Open Kunhao18 opened 11 months ago
First of all, there are differences in computing capabilities between NPU
and GPU
. For example, NPU
needs to be continuous when calculating. In torch_npu, we use a private format to express tensors. The bucketing and transmission of data are indeed different from those implemented in CUDA
, But there is currently no problem in terms of functionality. Have you tried fixed input and random seeds, and used deterministic calculations to compare the loss at each step?
1. Description
We are doing supervised fine-tuning on large language models with
peft
andtrl
packages. The convergence is way too slow on Ascend NPUs compared with GPUs. The loss started from 1.3, and reduced to 0.3 in the first half epoch on V100 while it remained around 0.8 even after 5 epochs on Ascend 910B.We are using
accelerate launch
for distributed training. The training scripts and arguments are the same for difference devices except for thecuda
andnpu
parts.There were warnings that could be the cause of the problem:
We've checked inside the source code according to the messages, and found the difference in the
reducer.cpp
between the originaltorch
andtorch_npu
:torch
torch_npu
It seems the
torch_npu
doesn't support thebucket_view
stride matching with the gradient.2. environment