cuda.is_available is False in LLMRayActor

THINK2TRY commented 4 months ago

Hi, thanks for the great work. I encountered the problem when initializing vllm-engine engine in PPO training. It seems that the program cannot find available GPUS in the initialization.

  File "/workspace/code/OpenRLHF/openrlhf/trainer/ray/vllm_engine.py", line 37, in __init__
    self.llm = vllm.LLM(*args, **kwargs)
  File "/workspace/code/vllm/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/workspace/zhenyu/code/vllm/vllm/engine/llm_engine.py", line 385, in from_engine_args
    engine_configs = engine_args.create_engine_configs()
  File "/workspace/code/vllm/vllm/engine/arg_utils.py", line 286, in create_engine_configs
    device_config = DeviceConfig(self.device)
  File "/workspace/code/vllm/vllm/config.py", line 484, in __init__
    raise RuntimeError("No supported device detected.")
RuntimeError: No supported device detected.

For my environment

vllm=0.3.2+cu123
transformers=4.38.1
accelerate=0.27.2

wuxibin89 commented 4 months ago

@THINK2TRY I can't reproduce with vllm==0.3.2, can you post you run script? B.T.W I fixed vllm version compatibility problem in https://github.com/OpenLLMAI/OpenRLHF/pull/215, please make sure you're using latest OpenRLHF.

THINK2TRY commented 4 months ago

Hi @wuxibin89 , for the whole training framework, I encountered this problem when setting tensor_parallel_size > 1 as I want to train a 30B+ model. The script is as follows:

ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env-json='{"working_dir": "/workspace/code/OpenRLHF", "pip": "/workspace/code/OpenRLHF/requirements.txt"}' \
    -- python examples/train_ppo_ray.py \
    --ref_num_nodes 1 \
    --ref_num_gpus_per_node 2 \
    --reward_num_nodes 1 \
    --reward_num_gpus_per_node 2 \
    --critic_num_nodes 1 \
    --critic_num_gpus_per_node 4 \
    --actor_num_nodes 1 \
    --actor_num_gpus_per_node 4 \
    --vllm_num_engines 2 \
    --vllm_tensor_parallel_size 2 \
    --pretrain /workspace/checkpoints/model \
    --reward_pretrain /workspace/checkpoints/openrlhf/reward/ \
    --save_path /workspace/checkpoints/openrlhf/ppo \
    --micro_train_batch_size 4 \
    --train_batch_size 128 \
    --micro_rollout_batch_size 8 \
    --rollout_batch_size 1024 \
    --max_epochs 1 \
    --prompt_max_len 1024 \
    --generate_max_len 1024 \
    --zero_stage 2 \
    --bf16 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    --init_kl_coef 0.01 \
    --prompt_data /workspace/data/prompt_data/prompt.jsonl \
    --prompt_data_probs 1 \
    --max_samples 80000 \
    --normalize_reward \
    --actor_init_on_gpu \
    --adam_offload \
    --gradient_checkpointing \

Then I tried to directly run vllm_engine.py but also find the error.

And ray status could find available GPUs

Resources
---------------------------------------------------------------
Usage:
 0.0/128.0 CPU
 0.0/8.0 GPU
 0B/703.08GiB memory
 0B/186.26GiB object_store_memory

wuxibin89 commented 4 months ago

According to your scripts, you need 16 GPUs in total(2reference+2reward+4actor+4critic+4*vllm), but your ray cluster only has 8 GPUs. Do you have 2 nodes with 8 GPUs each?

Resources
---------------------------------------------------------------
Usage:
 0.0/128.0 CPU
 0.0/8.0 GPU
 0B/703.08GiB memory
 0B/186.26GiB object_store_memory

wuxibin89 commented 4 months ago

Do you customize vllm? I can't find this RuntimeError("No supported device detected.") in vllm==0.3.2

  File "/workspace/code/OpenRLHF/openrlhf/trainer/ray/vllm_engine.py", line 37, in __init__
    self.llm = vllm.LLM(*args, **kwargs)
  File "/workspace/code/vllm/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/workspace/zhenyu/code/vllm/vllm/engine/llm_engine.py", line 385, in from_engine_args
    engine_configs = engine_args.create_engine_configs()
  File "/workspace/code/vllm/vllm/engine/arg_utils.py", line 286, in create_engine_configs
    device_config = DeviceConfig(self.device)
  File "/workspace/code/vllm/vllm/config.py", line 484, in __init__
    raise RuntimeError("No supported device detected.")
RuntimeError: No supported device detected.

THINK2TRY commented 4 months ago

I just use the latest vllm in github and install it via pip install -e . rather than pip install vllm==0.3.2. It seems that the error comes from the latest update. https://github.com/vllm-project/vllm/blob/05af6da8d927f70d15ab1ed25b01df3c967ad961/vllm/config.py#L506

THINK2TRY commented 4 months ago

According to your scripts, you need 16 GPUs in total(2_reference+2_reward+4_actor+4_critic+4*vllm), but your ray cluster only has 8 GPUs. Do you have 2 nodes with 8 GPUs each?
Resources
---------------------------------------------------------------
Usage:
 0.0/128.0 CPU
 0.0/8.0 GPU
 0B/703.08GiB memory
 0B/186.26GiB object_store_memory

Yes, when 16 GPUs are used when I run the script

wuxibin89 commented 4 months ago

I just use the latest vllm in github and install it via pip install -e . rather than pip install vllm==0.3.2. It seems that the error comes from the latest update. https://github.com/vllm-project/vllm/blob/05af6da8d927f70d15ab1ed25b01df3c967ad961/vllm/config.py#L506

Ah, I think there's a bug in vllm if LLMEngine is initialized in ray actor. Please git checkout -b v0.3.2 v0.3.2 and pip install -e .. I will fire an issue to them.

THINK2TRY commented 4 months ago

I will have a try. Many thanks for your reply!

wuxibin89 commented 4 months ago

I submit a fix PR https://github.com/vllm-project/vllm/pull/3198

OpenLLMAI / OpenRLHF

cuda.is_available is False in LLMRayActor #233