Convergence on SFT is too slow and the performance is bad

1. Description

We are doing supervised fine-tuning on large language models with peft and trl packages. The convergence is way too slow on Ascend NPUs compared with GPUs. The loss started from 1.3, and reduced to 0.3 in the first half epoch on V100 while it remained around 0.8 even after 5 epochs on Ascend 910B.

We are using accelerate launch for distributed training. The training scripts and arguments are the same for difference devices except for the cuda and npu parts.

There were warnings that could be the cause of the problem:

.../python3.9/site-packages/torch/autograd/__init__.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 64], strides() = [64, 1]
bucket_view.sizes() = [65536], strides() = [1] (Triggered internally at /usr1/02/workspace/j_yxiCvvHE/pytorch/torch_npu/csrc/distributed/reducer.cpp:314.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

We've checked inside the source code according to the messages, and found the difference in the reducer.cpp between the original torch and torch_npu:

In torch

void Reducer::initialize_bucket_views(Reducer::Bucket& bucket) {
const auto& gradients = bucket.gradients;
for (const auto i : c10::irange(bucket.variables.size())) {
auto& v = bucket.variables[i];
const auto offset = bucket.offsets[i];
const auto length = bucket.lengths[i];
// TODO(@egienvalue): remove special case after view ops are fully
// supported on MTIA.
// In general, on MTIA, due to the special memory layout, it doesn't
// support as_strided which creates a view tensor and aten::view will
// create a new tensor on MTIA for now.
if (v.is_non_overlapping_and_dense() && !v.is_mtia()) {
  // If the param's memory is dense, match its layout, anticipating
  // the autograd engine (AccumulateGrad) will also create gradients
  // matching its layout.
  bucket.bucket_views_in.push_back(
      gradients.as_strided(v.sizes(), v.strides(), offset));
} else {
  // Fall back to a C-style contiguous view, again anticipating
  // AccumulateGrad will do the same when stashing grads for non-dense
  // params.
  bucket.bucket_views_in.push_back(
      gradients.narrow(0, offset, length).view(v.sizes()));
}
...

In torch_npu

void Reducer::initialize_bucket_views(
Reducer::BucketReplica& replica,
at::Tensor& contents) {
for (const auto i : c10::irange(replica.variables.size())) {
auto& v = replica.variables[i];
const auto offset = replica.offsets[i];
const auto length = replica.lengths[i];
// element size of 'bucket_views_in' depends on variable 'gradient_as_bucket_view_'.
if (!gradient_as_bucket_view_) {
    replica.bucket_views_in.push_back(contents.narrow(0, offset, length));
} else {
    replica.bucket_views_in.push_back(contents.narrow(0, offset, length).view(v.sizes()));
}

It seems the torch_npu doesn't support the bucket_view stride matching with the gradient.

2. environment

OS: Ubuntu 20.04
kernel: 4.19.90-vhulk2211.3.0.h1543.eulerosv2r10.aarch64
CANN: 7.0.RC1
Python: 3.9
torch: 2.1.0
torch_npu: 2.1.0rc1.post20231013

Ascend / pytorch

Convergence on SFT is too slow and the performance is bad #17

1. Description

2. environment