lm_eval --model vllm did not work when data_parallel_size > 1

wukaixingxp commented 1 month ago

We noticed that lm_eval --model vllm did not work when data_parallel_size > 1 and got Error: No available node types can fulfill resource request from Ray. After some research, I believe when tensor_parallel_size=1 we should use multiprocessing instead of ray (in this line) for the latest vLLM. My code works on data_parallel_size=1 but got the following error when data_parallel_size > 1, the logs are below, please help! Log:

(llama) $ pip list | grep vllm
vllm                                     0.6.2
vllm-flash-attn                          2.5.9.post1
(llama) $ CUDA_VISIBLE_DEVICES=4,5,6,7 lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.1-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=4,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_pretrain --batch_size auto --output_path eval_results --include_path ./work_dir --seed 42  --log_samples
2024-10-02:13:21:55,591 INFO     [__main__.py:272] Verbosity set to INFO
2024-10-02:13:21:55,591 INFO     [__main__.py:303] Including path: ./work_dir
2024-10-02:13:21:59,000 INFO     [__main__.py:369] Selected Tasks: ['meta_pretrain']
2024-10-02:13:21:59,093 INFO     [evaluator.py:152] Setting random seed to 42 | Setting numpy seed to 42 | Setting torch manual seed to 42
2024-10-02:13:21:59,093 INFO     [evaluator.py:189] Initializing vllm model, with arguments: {'pretrained': 'meta-llama/Llama-3.1-8B', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.9, 'data_parallel_size': 4, 'max_model_len': 8192, 'add_bos_token': True, 'seed': 42}
2024-10-02:13:21:59,093 WARNING  [vllm_causallms.py:105] You might experience occasional issues with model weight downloading when data_parallel is in use. To ensure stable performance, run with data_parallel_size=1 until the weights are downloaded and cached.
2024-10-02:13:21:59,093 INFO     [vllm_causallms.py:110] Manual batching is not compatible with data parallelism.
2024-10-02:13:22:01,155 WARNING  [task.py:325] [Task: meta_bbh] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-02:13:22:01,159 WARNING  [task.py:325] [Task: meta_bbh] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-02:13:22:02,690 WARNING  [task.py:325] [Task: meta_mmlu_pro_pretrain] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-02:13:22:02,696 WARNING  [task.py:325] [Task: meta_mmlu_pro_pretrain] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-02:13:22:03,420 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-10-02:13:22:03,420 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-10-02:13:22:03,422 INFO     [task.py:411] Building contexts for meta_mmlu_pro_pretrain on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12032/12032 [00:00<00:00, 29428.92it/s]
2024-10-02:13:22:04,566 INFO     [task.py:411] Building contexts for meta_bbh on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6511/6511 [00:00<00:00, 153755.41it/s]
2024-10-02:13:22:04,876 INFO     [evaluator.py:438] Running generate_until requests
Running generate_until requests:   0%|                                                                                                                  | 0/18543 [00:00<?, ?it/s]2024-10-02 13:22:32,034 INFO worker.py:1783 -- Started a local Ray instance.
(run_inference_one_model pid=1394830) Calling ray.init() again after it has already been called.
(autoscaler +46s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +46s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=1394830) INFO 10-02 13:22:47 ray_utils.py:183] Waiting for creating a placement group of specs for 10 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources.
(run_inference_one_model pid=1394830) INFO 10-02 13:23:07 ray_utils.py:183] Waiting for creating a placement group of specs for 30 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(autoscaler +1m21s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=1394830) INFO 10-02 13:23:47 ray_utils.py:183] Waiting for creating a placement group of specs for 70 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster]
(autoscaler +1m56s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +2m31s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

Meanwhile the ray status shows 4gpu available:

ray status
2024-10-02 13:25:26,488 - INFO - Note: detected 384 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-10-02 13:25:26,488 - INFO - Note: NumExpr detected 384 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2024-10-02 13:25:26,488 - INFO - NumExpr defaulting to 16 threads.
======== Autoscaler status: 2024-10-02 13:25:26.672321 ========
Node status
---------------------------------------------------------------
Active:
 1 node_497fbea7f83f313a8d6a3894bfaffdc98e398f9ee2ea8078803286cd
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/4.0 GPU
 0B/1.85TiB memory
 52.77MiB/186.26GiB object_store_memory

Demands:
 {'GPU': 1.0, 'node:2401:db00:23c:1314:face:0:34f:0': 0.001} * 1 (PACK): 4+ pending placement groups

baberabb commented 1 month ago

Hi! I cannot reproduce this (on an A40 node). Are you already using a Ray instance? Think that might be the issue, as I don't get the autoscaler messages as in your log. Also haven't been able to initialize multiple models inside a multiprocessing context, as vllm wants to create child processes and that's not allowed.

cc: @mgoin in case they have any tips!

wukaixingxp commented 1 month ago

I think vLLM will use "mp" (multiprocessing) for 1gpu inference by default as stated in this line, but lm_eval is still using ray. Correct me if I am wrong.

wukaixingxp commented 1 month ago

My command:lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=4 --tasks lambada_openai --batch_size auto Log:

lm_eval --model vllm  --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=4 --tasks lambada_openai --batch_size auto
2024-10-04:14:46:18,784 INFO     [__main__.py:279] Verbosity set to INFO
2024-10-04:14:46:18,815 INFO     [__init__.py:491] `group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. `group`s which aggregate across subtasks must be only defined in a separate group config file, which will be the official way to create groups that support cross-task aggregation as in `mmlu`. Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs for more information.
2024-10-04:14:46:23,062 INFO     [__main__.py:376] Selected Tasks: ['lambada_openai']
2024-10-04:14:46:23,152 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-10-04:14:46:23,152 INFO     [evaluator.py:198] Initializing vllm model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3.1-8B', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8, 'data_parallel_size': 4}
2024-10-04:14:46:23,152 WARNING  [vllm_causallms.py:105] You might experience occasional issues with model weight downloading when data_parallel is in use. To ensure stable performance, run with data_parallel_size=1 until the weights are downloaded and cached.
2024-10-04:14:46:23,152 INFO     [vllm_causallms.py:110] Manual batching is not compatible with data parallelism.
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.16M/1.16M [00:00<00:00, 5.40MB/s]
Generating test split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:00<00:00, 654272.83 examples/s]
2024-10-04:14:46:27,058 WARNING  [task.py:337] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-04:14:46:27,059 WARNING  [task.py:337] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-04:14:46:27,099 INFO     [evaluator.py:279] Setting fewshot random generator seed to 1234
2024-10-04:14:46:27,099 WARNING  [model.py:422] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2024-10-04:14:46:27,100 INFO     [task.py:423] Building contexts for lambada_openai on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:05<00:00, 969.30it/s]
2024-10-04:14:46:32,456 INFO     [evaluator.py:465] Running loglikelihood requests
Running loglikelihood requests:   0%|                                                                                                                    | 0/5153 [00:00<?, ?it/s]2024-10-04 14:46:35,488 INFO worker.py:1783 -- Started a local Ray instance.
(run_inference_one_model pid=70741) WARNING 10-04 14:46:42 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
(run_inference_one_model pid=70741) INFO 10-04 14:46:42 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512.
(run_inference_one_model pid=70741) Calling ray.init() again after it has already been called.
(autoscaler +26s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +26s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=70951) INFO 10-04 14:46:52 ray_utils.py:183] Waiting for creating a placement group of specs for 10 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources.
(run_inference_one_model pid=70814) WARNING 10-04 14:46:43 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(run_inference_one_model pid=70814) INFO 10-04 14:46:43 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512. [repeated 3x across cluster]
(run_inference_one_model pid=70951) INFO 10-04 14:47:12 ray_utils.py:183] Waiting for creating a placement group of specs for 30 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster]
(autoscaler +1m1s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=70951) INFO 10-04 14:47:52 ray_utils.py:183] Waiting for creating a placement group of specs for 70 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster]
(autoscaler +1m37s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

baberabb commented 1 month ago

I think vLLM will use "mp" (multiprocessing) for 1gpu inference by default as stated in this line, but lm_eval is still using ray. Correct me if I am wrong.

It should use mp when data_parallel_size=1. Otherwise we need to initialize multiple vllm instances and I haven't found a way to do that outside of a ray context.

You could try here

@ray.remote(num_gpus=1, num_cpus=1)

and maybe also calling ray.init(...) with the appropriate args beforehand, but this didn't work properly with tensor_parallel_size > 1 iirc.

Alternatively you could serve the models separately and use local-completions to send in the requests

EleutherAI / lm-evaluation-harness

lm_eval --model vllm did not work when data_parallel_size > 1 #2379