BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
12.32k stars 838 forks source link

train.py: "ninja build stopped" (lcudart, cicc, etc. not found) #104

Closed leoplusx closed 1 year ago

leoplusx commented 1 year ago

I'm having trouble running the train.py file.

I keep getting erros when building the cuda extensions:

ninja: build stopped: subcommand failed

Errors like this:

/usr/bin/ld: cannot find -lcudart

sh: 1: cicc: not found

It looks like torch doesn't find the cuda library files.

So I tried manually searching for those files, and then setting the env LD_LIBRARY_PATH, like this:

find / -name "libcuda.so"
/opt/conda/lib/stubs/libcuda.so
/opt/conda/pkgs/cuda-driver-dev-11.6.55-0/lib/stubs/libcuda.so
/opt/conda/lib64/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so

With some tweaking of env variables and creating symlinks, I was able to get past one error - but only to encounter the next one.

The server I used is using a NVIDIA GeForce RTX 4090.

I used this image:

nvidia/cuda:11.4.3-devel-ubuntu20.04

I installed these versions:

pip install torch -f https://download.pytorch.org/whl/cu111/torch_stable.html deepspeed==0.7.0 pytorch-lightning==1.9.2

And ran the train file like this:

python train.py --load_model "" --wandb "" --proj_dir "out" \
--data_file "test.txt" --data_type "utf-8" --vocab_size 0 \
--ctx_len 512 --epoch_steps 5000 --epoch_count 500 --epoch_begin 0 --epoch_save 5 \
--micro_bsz 12 --n_layer 6 --n_embd 512 --pre_ffn 0 --head_qk 0 \
--lr_init 8e-4 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 1 --precision bf16 --strategy ddp_find_unused_parameters_false --grad_cp 0

Questions:

Thank you.

BlinkDL commented 1 year ago

You need cuda 12+ for 4090

export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH