BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
12.39k stars 843 forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #79

Closed cahya-wirawan closed 1 year ago

cahya-wirawan commented 1 year ago

Hi, When I try to fine tune the model using following command:

python train.py --load_model /home/cahya/Work/models/rwkv/rwkv-4-pile-3b/RWKV-4-Pile-3B-20221110-ctx4096.pth --wandb "rwkv-3b" --proj_dir "out-3b" \
        --data_file "./train.npy" --data_type "numpy" --vocab_size 50277 \
        --ctx_len 4096 --epoch_steps 1000 --epoch_count 100 --epoch_begin 0 --epoch_save 5 \
        --micro_bsz 1 --n_layer 32 --n_embd 2560 --pre_ffn 0 --head_qk 0 \
        --lr_init 4e-4 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.999 --adam_eps 1e-8 \
        --accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp 0

I get following error message:

Traceback (most recent call last):
  File "/home/cahya/_Work/RWKV-LM/RWKV-v4neo/train.py", line 350, in <module>│···
    trainer.fit(model, data_loader)│····
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit│···
    call._call_and_handle_interrupt(│···
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt│···
    trainer._teardown()│······
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1175, in _teardown│···
    self.strategy.teardown()│······
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 490, in teardown│···
    super().teardown()│···pr-23
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/pytorch_lightning/strategies/parallel.py", line 125, in teardown│···
    super().teardown()│···
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 496, in teardown│···
    self.lightning_module.cpu()│···
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 78, in cpu│···
    return super().cpu()│···
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 954, in cpu│···
    return self._apply(lambda t: t.cpu())│···
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply│···
    module._apply(fn)│···
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 820, in _apply│···
    param_applied = fn(param)│···
  File "/home/cahya/miniconda3/envs/rwkv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 954, in <lambda>│···
    return self._apply(lambda t: t.cpu())│···
RuntimeError: CUDA error: an illegal memory access was encountered│···
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.│···

What could be wrong here? It works when I fine tuned the 169m model. The GPU is A100.. Thanks

cahya-wirawan commented 1 year ago

It seems it needs more than a GPU, because after I run it on all 8, it works now.