Process hangs when using `tensor_parallel_size` and `data_parallel_size` together

EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

https://www.eleuther.ai

MIT License

5.71k stars 1.52k forks source link

Process hangs when using `tensor_parallel_size` and `data_parallel_size` together #1734

Open harshakokel opened 2 months ago

harshakokel commented 2 months ago

Hello,

I noticed that my process hangs at results = ray.get(object_refs) when I use data_parallel_size as well as tensor_parallel_size for vllm models.

For example, this call would hang.

lm_eval --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

These would not.

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=1,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=1 --tasks arc_easy --output ./trial/  --log_samples --limit 10

Does anyone else face similar problem?

haileyschoelkopf commented 2 months ago

Hi! What version of vLLM are you running with?

@baberabb has observed some problems like this before with later versions ( >v0.3.3 I believe) of vllm.

harshakokel commented 2 months ago

I am on vllm 0.3.2.

harshakokel commented 2 months ago

Is this a vllm problem? Should I be raising an issue on that repo?

baberabb commented 2 months ago

Hey. Have you tried caching the weights by running with DP=1 until they are downloaded? I found it prone to hang with DP otherwise.

harshakokel commented 2 months ago

Yes, the weights are cached. The process is hanging after llm.generate returns results.

baberabb commented 2 months ago

Yes, the weights are cached. The process is hanging after llm.generate returns results.

hmm. It's working for me with 0.3.2. Have you tried running on a fresh virtual environment?

harshakokel commented 2 months ago

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

baberabb commented 2 months ago

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

Probably the latest one. I installed it with pip install -e ".[vllm]" on runpod with 4 GPUs.