Closed THINK2TRY closed 3 months ago
@THINK2TRY I can't reproduce with vllm==0.3.2, can you post you run script? B.T.W I fixed vllm version compatibility problem in https://github.com/OpenLLMAI/OpenRLHF/pull/215, please make sure you're using latest OpenRLHF.
Hi @wuxibin89 , for the whole training framework, I encountered this problem when setting tensor_parallel_size > 1
as I want to train a 30B+ model. The script is as follows:
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "/workspace/code/OpenRLHF", "pip": "/workspace/code/OpenRLHF/requirements.txt"}' \
-- python examples/train_ppo_ray.py \
--ref_num_nodes 1 \
--ref_num_gpus_per_node 2 \
--reward_num_nodes 1 \
--reward_num_gpus_per_node 2 \
--critic_num_nodes 1 \
--critic_num_gpus_per_node 4 \
--actor_num_nodes 1 \
--actor_num_gpus_per_node 4 \
--vllm_num_engines 2 \
--vllm_tensor_parallel_size 2 \
--pretrain /workspace/checkpoints/model \
--reward_pretrain /workspace/checkpoints/openrlhf/reward/ \
--save_path /workspace/checkpoints/openrlhf/ppo \
--micro_train_batch_size 4 \
--train_batch_size 128 \
--micro_rollout_batch_size 8 \
--rollout_batch_size 1024 \
--max_epochs 1 \
--prompt_max_len 1024 \
--generate_max_len 1024 \
--zero_stage 2 \
--bf16 \
--actor_learning_rate 5e-7 \
--critic_learning_rate 9e-6 \
--init_kl_coef 0.01 \
--prompt_data /workspace/data/prompt_data/prompt.jsonl \
--prompt_data_probs 1 \
--max_samples 80000 \
--normalize_reward \
--actor_init_on_gpu \
--adam_offload \
--gradient_checkpointing \
Then I tried to directly run vllm_engine.py but also find the error.
And ray status
could find available GPUs
Resources
---------------------------------------------------------------
Usage:
0.0/128.0 CPU
0.0/8.0 GPU
0B/703.08GiB memory
0B/186.26GiB object_store_memory
According to your scripts, you need 16 GPUs in total(2reference+2reward+4actor+4critic+4*vllm), but your ray cluster only has 8 GPUs. Do you have 2 nodes with 8 GPUs each?
Resources
---------------------------------------------------------------
Usage:
0.0/128.0 CPU
0.0/8.0 GPU
0B/703.08GiB memory
0B/186.26GiB object_store_memory
Do you customize vllm? I can't find this RuntimeError("No supported device detected.")
in vllm==0.3.2
File "/workspace/code/OpenRLHF/openrlhf/trainer/ray/vllm_engine.py", line 37, in __init__
self.llm = vllm.LLM(*args, **kwargs)
File "/workspace/code/vllm/vllm/entrypoints/llm.py", line 109, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/workspace/zhenyu/code/vllm/vllm/engine/llm_engine.py", line 385, in from_engine_args
engine_configs = engine_args.create_engine_configs()
File "/workspace/code/vllm/vllm/engine/arg_utils.py", line 286, in create_engine_configs
device_config = DeviceConfig(self.device)
File "/workspace/code/vllm/vllm/config.py", line 484, in __init__
raise RuntimeError("No supported device detected.")
RuntimeError: No supported device detected.
I just use the latest vllm in github and install it via pip install -e .
rather than pip install vllm==0.3.2
. It seems that the error comes from the latest update. https://github.com/vllm-project/vllm/blob/05af6da8d927f70d15ab1ed25b01df3c967ad961/vllm/config.py#L506
According to your scripts, you need 16 GPUs in total(2_reference+2_reward+4_actor+4_critic+4*vllm), but your ray cluster only has 8 GPUs. Do you have 2 nodes with 8 GPUs each?
Resources --------------------------------------------------------------- Usage: 0.0/128.0 CPU 0.0/8.0 GPU 0B/703.08GiB memory 0B/186.26GiB object_store_memory
Yes, when 16 GPUs are used when I run the script
I just use the latest vllm in github and install it via
pip install -e .
rather thanpip install vllm==0.3.2
. It seems that the error comes from the latest update. https://github.com/vllm-project/vllm/blob/05af6da8d927f70d15ab1ed25b01df3c967ad961/vllm/config.py#L506
Ah, I think there's a bug in vllm if LLMEngine is initialized in ray actor. Please git checkout -b v0.3.2 v0.3.2
and pip install -e .
. I will fire an issue to them.
I will have a try. Many thanks for your reply!
I submit a fix PR https://github.com/vllm-project/vllm/pull/3198
Hi, thanks for the great work. I encountered the problem when initializing vllm-engine engine in PPO training. It seems that the program cannot find available GPUS in the initialization.
For my environment