Open fgdfgfthgr-fox opened 5 months ago
Get same error
I think I get the same error
[rank22]: File "/opt/conda/lib/python3.11/site-packages/flash_attn/bert_padding.py", line 212, in pad_input
[rank22]: output = index_put_first_axis(hidden_states, indices, batch * seqlen)
[rank22]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank22]: File "/opt/conda/lib/python3.11/site-packages/torch/autograd/function.py", line 598, in apply
[rank22]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank22]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank22]: File "/opt/conda/lib/python3.11/site-packages/flash_attn/bert_padding.py", line 51, in forward
[rank22]: output[indices] = values
[rank22]: ~~~~~~^^^^^^^^^
[rank22]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank22]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank22]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank22]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f39d2711897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
Please check that this issue hasn't been reported before.
Expected Behavior
I expect similar loss and grad_norm when training a model with the same setting regardless whether flash attention is enabled or not.
Current behaviour
Currently, during training steps (right from the start), I can see messages of
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6.545084971874738e-06, 'epoch': 0.4}
for few steps, before aerror appear and the training stops.
However, if flash attention is disabled with
flash_attention: false
, then the network trains normally.{'loss': 3.0972, 'grad_norm': 0.76171875, 'learning_rate': 3.4549150281252635e-06, 'epoch': 0.6}
Steps to reproduce
echo "Setting up python venv..." python -m venv venv source venv/bin/activate python -m pip install --upgrade pip pip install -U wheel pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 -I pip install ninja export TORCH_CUDA_ARCH_LIST="8.6;8.9" export CUDA_VISIBLE_DEVICES=2 export LD_LIBRARY_PATH=/home/huada524/ondemand/data/sys/myjobs/projects/default/1/venv/lib64/python3.10/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH pip install -v -U "git+https://github.com/facebookresearch/xformers.git@main#egg=xformers"
cd axolotl git pull pip install packaging pip install -e '.[flash-attn,deepspeed]'
I manually disabled xformers installation from axolotl/requirements.txt so it won't attempt to override the one I just compiled with.
I also have to apply this patch https://github.com/microsoft/DeepSpeed/issues/5603 to make sure axolotl would launch
cd ..
module load python module load cuda
source venv/bin/activate
export CUDA_VISIBLE_DEVICES=2
export WANDB_API_KEY=xxxxxxx export LD_LIBRARY_PATH=/home/huada524/ondemand/data/sys/myjobs/projects/default/1/venv/lib64/python3.10/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
accelerate launch -m axolotl.cli.train config_llama3_40B_dora.yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
5783839c6e29bb148041338772040c85aaae4646
Acknowledgements