THUDM / ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型
Other
15.68k stars 1.85k forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #602

Open leoluopy opened 10 months ago

leoluopy commented 10 months ago

Is there an existing issue for this?

Current Behavior

Traceback (most recent call last): File "main.py", line 411, in main() File "main.py", line 350, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/leo/.pyenv/versions/anaconda3-2021.05/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train return inner_training_loop( File "/home/leo/.pyenv/versions/anaconda3-2021.05/lib/python3.8/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/leo/.pyenv/versions/anaconda3-2021.05/lib/python3.8/site-packages/transformers/trainer.py", line 2770, in training_step self.accelerator.backward(loss) File "/home/leo/.pyenv/versions/anaconda3-2021.05/lib/python3.8/site-packages/accelerate/accelerator.py", line 1821, in backward loss.backward(*kwargs) File "/home/leo/.pyenv/versions/anaconda3-2021.05/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/leo/.pyenv/versions/anaconda3-2021.05/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/leo/.pyenv/versions/anaconda3-2021.05/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, args) File "/home/leo/.pyenv/versions/anaconda3-2021.05/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/leo/.pyenv/versions/anaconda3-2021.05/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Expected Behavior

No response

Steps To Reproduce

install requrements ,and start ptuning, after several steps , it collapse ,

Environment

- OS: ubuntu 18.04
- Python: 3.8
- Transformers: 4.30.2
- PyTorch: 2.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True

Anything else?

No response

leoluopy commented 10 months ago

here i come up with a solution which you need several steps below: remove all old python environment and reinstall it , now i am using anaconda3-2023.09 , python 3.11 then update the cuda driver version , now i am using nvidia-driver-535 , here's the install guide: https://blog.csdn.net/m0_59023219/article/details/131000872

hope it can help some later .