Open wukaixingxp opened 1 month ago
Hi! I cannot reproduce this (on an A40 node). Are you already using a Ray instance? Think that might be the issue, as I don't get the autoscaler messages as in your log. Also haven't been able to initialize multiple models inside a multiprocessing context, as vllm wants to create child processes and that's not allowed.
cc: @mgoin in case they have any tips!
My command:lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=4 --tasks lambada_openai --batch_size auto
Log:
lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=4 --tasks lambada_openai --batch_size auto
2024-10-04:14:46:18,784 INFO [__main__.py:279] Verbosity set to INFO
2024-10-04:14:46:18,815 INFO [__init__.py:491] `group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. `group`s which aggregate across subtasks must be only defined in a separate group config file, which will be the official way to create groups that support cross-task aggregation as in `mmlu`. Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs for more information.
2024-10-04:14:46:23,062 INFO [__main__.py:376] Selected Tasks: ['lambada_openai']
2024-10-04:14:46:23,152 INFO [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-10-04:14:46:23,152 INFO [evaluator.py:198] Initializing vllm model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3.1-8B', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8, 'data_parallel_size': 4}
2024-10-04:14:46:23,152 WARNING [vllm_causallms.py:105] You might experience occasional issues with model weight downloading when data_parallel is in use. To ensure stable performance, run with data_parallel_size=1 until the weights are downloaded and cached.
2024-10-04:14:46:23,152 INFO [vllm_causallms.py:110] Manual batching is not compatible with data parallelism.
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.16M/1.16M [00:00<00:00, 5.40MB/s]
Generating test split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:00<00:00, 654272.83 examples/s]
2024-10-04:14:46:27,058 WARNING [task.py:337] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-04:14:46:27,059 WARNING [task.py:337] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-04:14:46:27,099 INFO [evaluator.py:279] Setting fewshot random generator seed to 1234
2024-10-04:14:46:27,099 WARNING [model.py:422] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2024-10-04:14:46:27,100 INFO [task.py:423] Building contexts for lambada_openai on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:05<00:00, 969.30it/s]
2024-10-04:14:46:32,456 INFO [evaluator.py:465] Running loglikelihood requests
Running loglikelihood requests: 0%| | 0/5153 [00:00<?, ?it/s]2024-10-04 14:46:35,488 INFO worker.py:1783 -- Started a local Ray instance.
(run_inference_one_model pid=70741) WARNING 10-04 14:46:42 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
(run_inference_one_model pid=70741) INFO 10-04 14:46:42 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512.
(run_inference_one_model pid=70741) Calling ray.init() again after it has already been called.
(autoscaler +26s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +26s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=70951) INFO 10-04 14:46:52 ray_utils.py:183] Waiting for creating a placement group of specs for 10 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources.
(run_inference_one_model pid=70814) WARNING 10-04 14:46:43 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(run_inference_one_model pid=70814) INFO 10-04 14:46:43 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512. [repeated 3x across cluster]
(run_inference_one_model pid=70951) INFO 10-04 14:47:12 ray_utils.py:183] Waiting for creating a placement group of specs for 30 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster]
(autoscaler +1m1s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=70951) INFO 10-04 14:47:52 ray_utils.py:183] Waiting for creating a placement group of specs for 70 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster]
(autoscaler +1m37s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
I think vLLM will use
"mp" (multiprocessing)
for 1gpu inference by default as stated in this line, but lm_eval is still using ray. Correct me if I am wrong.
It should use mp
when data_parallel_size=1
. Otherwise we need to initialize multiple vllm instances and I haven't found a way to do that outside of a ray context.
You could try here
@ray.remote(num_gpus=1, num_cpus=1)
and maybe also calling ray.init(...)
with the appropriate args beforehand, but this didn't work properly with tensor_parallel_size > 1
iirc.
Alternatively you could serve the models separately and use local-completions
to send in the requests
We noticed that lm_eval --model vllm did not work when data_parallel_size > 1 and got
Error: No available node types can fulfill resource request
from Ray. After some research, I believe whentensor_parallel_size=1
we should use multiprocessing instead of ray (in this line) for the latest vLLM. My code works ondata_parallel_size=1
but got the following error whendata_parallel_size > 1
, the logs are below, please help! Log:Meanwhile the
ray status
shows 4gpu available: