Ascend / pytorch

Ascend PyTorch adapter (torch_npu). Mirror of https://gitee.com/ascend/pytorch
https://ascend.github.io/docs/
Other
259 stars 15 forks source link

Convergence on SFT is too slow and the performance is bad #17

Open Kunhao18 opened 11 months ago

Kunhao18 commented 11 months ago

1. Description

We are doing supervised fine-tuning on large language models with peft and trl packages. The convergence is way too slow on Ascend NPUs compared with GPUs. The loss started from 1.3, and reduced to 0.3 in the first half epoch on V100 while it remained around 0.8 even after 5 epochs on Ascend 910B.

We are using accelerate launch for distributed training. The training scripts and arguments are the same for difference devices except for the cuda and npu parts.

There were warnings that could be the cause of the problem:

.../python3.9/site-packages/torch/autograd/__init__.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 64], strides() = [64, 1]
bucket_view.sizes() = [65536], strides() = [1] (Triggered internally at /usr1/02/workspace/j_yxiCvvHE/pytorch/torch_npu/csrc/distributed/reducer.cpp:314.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

We've checked inside the source code according to the messages, and found the difference in the reducer.cpp between the original torch and torch_npu:

2. environment

fakeYan commented 7 months ago

First of all, there are differences in computing capabilities between NPUand GPU. For example, NPUneeds to be continuous when calculating. In torch_npu, we use a private format to express tensors. The bucketing and transmission of data are indeed different from those implemented in CUDA, But there is currently no problem in terms of functionality. Have you tried fixed input and random seeds, and used deterministic calculations to compare the loss at each step?