Can't start PPO_finetuning example with 1 machine and 1 GPU

flowersteam / lamorel

Lamorel is a Python library designed for RL practitioners eager to use Large Language Models (LLMs).

MIT License

193 stars 18 forks source link

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Hi @tokarev-i-v,

Thanks for reaching out!

I updated the readme as it was misleading (see PR #15). Indeed when GPU(s) are available, Accelerate automatically tries to allocate a different device to each process. In your case the lamorel launcher starts two processes yet only one GPU is available. To avoid this, you must launch two separate processes by hand (each being in the end a single process for Accelerate):

RL script => python -m lamorel_launcher.launch --config-path absolute/path/to/project/examples/configs --config-name local_gpu_config rl_script_args.path=absolute/path/to/project/examples/example_script.py lamorel_args.accelerate_args.machine_rank=0
LLM server =>python -m lamorel_launcher.launch --config-path absolute/path/to/project/examples/configs --config-name local_gpu_config rl_script_args.path=absolute/path/to/project/examples/example_script.py lamorel_args.accelerate_args.machine_rank=1

flowersteam / lamorel

Can't start PPO_finetuning example with 1 machine and 1 GPU #14