Open djstrong opened 8 months ago
Hi!
It's very surprising to me how slow vllm is. What batch size is vllm using? auto
would be the ideal. I faintly recall some issue mentioning vllm gets slowed down by having to return logprobs--@baberabb do you happen to know whether this is the case by chance?
@djstrong Previous issue about batch size affecting predictions: https://github.com/EleutherAI/lm-evaluation-harness/issues/704#issuecomment-1670189773.
It is still a good idea to check if proper padding and position_ids
are applied to models in batch mode — different HF transformers model classes have different approaches to undefined batch inputs in model.generate
and model.forward
, starting from GPT2LMHeadModel at least.
Those improvements will help others to dive deeply:
normal
refer to the transformers hf hub implementation?@haileyschoelkopf
The scores for bs=1 is 0.7033 and for bs=4 0.7111 (with stderr 0.01).
Logprobs are different for bs=1 and bs=4:
Flash attention without compile causes error on my setup:
RuntimeError: Failed to import transformers.models.mistral.modeling_mistral because of the following error (look up to see its traceback):
venv/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_
vllm was running with bs=4. Loglikelihood tasks with vllm are 4 times slower (tested with bs=1 and bs=4) than hf.
@LSinev
hf
.max_gen_toks: 50
You can replicate vllm loglikelihood slowness and different scores with e.g. lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1" --output_path "date/"
date +%s--tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 1 --log_samples
hf bs=1 00:52 0.3878 hf bs=4 00:23 0.39 vllm bs=1 02:49 0.39 vllm bs=4 01:38 0.3878
Thanks!
I'd recommend trying auto
batch size for vllm and seeing if that helps the speed.
Those sorts of differences in logprobs are expected when batch size changes--the issue @LSinev linked is a good one to reference. Unfortunately this can't very easily be "fixed", but the differences in logprobs from it should be very tiny, as you're seeing, and should likely not cause deviations that exceed stderr.
Like @haileyschoelkopf said, I think for a fair comparison, you should use a bs auto to take advantage of vLLM's continuous batching. Don't know if it slows down when logprobs are returned, but most of the tweaks in vllm are kv-cache related so makes sense it doesn't do so well with non-generation tasks. They also have experimental support for prefix caching across batches (pass enable_prefix_caching=True
to model_args, might have to add it to the model init to format the boolean correctly), which might speed things up (esp. for fewshot prompts).
mistral was particularly sensitive to batch differences: see #1425. Not sure what the reason was. Llama by comparison, not so much.
Thank you!
bs auto usually doesn't work and also this is the case: OOM vllm bs=auto OOM vllm bs=32 OOM vllm bs=16 01:31 0.3856
Thank you!
bs auto usually doesn't work and also this is the case: OOM vllm bs=auto OOM vllm bs=32 OOM vllm bs=16 01:31 0.3856
Oh, thats probably because of the large sequence length for mistral-7B (iirc defaults to ~32000, vllm preallocates memory according to that). You can set it lower to 2048 or 4096 with max_model_len
. Lowering the gpu_memory_utilization
from a default of 0.9 also helps with OOMs sometimes (but the former should be enough).
Thanks!
vllm bs=auto max_model_len=4096 01:33 (+01:30 for Processed prompts
?) 0.3856
Awesome! Added both these points to the readme in #1633 as this will probably confuse others too.
Using bs auto with vllm is causing some extra time for "Processed prompts" - I don't know what it is but finally it is slower than bs=1.
Remaining issues are:
About different scores with different batch sizes. I have run evaluation with max_len=1, 2 examples and bs 1 vs. 2.
lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1,max_length=1" --output_path "date/"`date +%s` --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 1 --log_samples --limit 2
lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1,max_length=1" --output_path "date/"`date +%s` --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 2 --log_samples --limit 2
Logs: https://www.diffchecker.com/CpV3RaDU/ (scores are the same with logits_cache=False
)
I have found exact place: https://github.com/EleutherAI/lm-evaluation-harness/blob/e9d429e105fa95dd4a1b5606b306289d207fcf62/lm_eval/models/huggingface.py#L1049 and replicated with minimal code (I get the same numbers in this line).
Model loaded on CPU with bfloat16 gives the same numbers:
>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype="auto")
>>> model(torch.tensor([[28994]])).logits
tensor([[[-11.5000, -11.2500, 3.2656, ..., -5.7500, -2.7500, -1.4375]]],
grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]])).logits
tensor([[[-11.5000, -11.2500, 3.2656, ..., -5.7500, -2.7500, -1.4375]],
[[-11.5000, -11.2500, 3.2656, ..., -5.7500, -2.7500, -1.4375]]],
grad_fn=<ToCopyBackward0>)
Model on GPU with bfloat16 gives different results:
>>> model.to('cuda')
>>> model(torch.tensor([[28994]]).to('cuda')).logits
tensor([[[-11.5625, -11.3125, 3.3281, ..., -5.7500, -2.6562, -1.3906]]],
device='cuda:0', grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]]).to('cuda')).logits
tensor([[[-11.6250, -11.3125, 3.1406, ..., -5.7812, -2.6562, -1.3516]],
[[-11.6250, -11.3125, 3.1406, ..., -5.7812, -2.6562, -1.3516]]],
device='cuda:0', grad_fn=<ToCopyBackward0>)
model on GPU with float16 gives the same results:
>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, device_map='cuda')
>>> model(torch.tensor([[28994]]).to('cuda')).logits
tensor([[[-11.6406, -11.3828, 3.2852, ..., -5.7617, -2.6348, -1.3047]]],
device='cuda:0', grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]]).to('cuda')).logits
tensor([[[-11.6406, -11.3828, 3.2852, ..., -5.7578, -2.6211, -1.2930]],
[[-11.6406, -11.3828, 3.2852, ..., -5.7578, -2.6211, -1.2930]]],
device='cuda:0', grad_fn=<ToCopyBackward0>)
So the problem is only on GPU with bfloat16. Using:
>>> torch.use_deterministic_algorithms(True)
>>> torch.backends.cudnn.benchmark = False
does not help.
So the problem is only on GPU with bfloat16
So the problem is actually with model implementation somewhere in huggingface transformers or some specific modules in torch, and may be sorted out at their repos if issued, not within lm-evaluation-harness repository?
Why do you think it is a problem with model implementation? But yes, it is not related to lm-evaluation-harness repository. Maybe it is some GPU optimization (cuBLAS?).
Nice bug hunting!
I think this is related to the lower precision of bfloat16 compared to float16. So doing adds in a different order somewhere (because different batch size launches a different kernel) is causing errors due to non-associativity of floating point math
lower precision of bfloat16 compared to float16. So doing adds in a different order somewhere (because different batch size launches a different kernel) is causing errors due to non-associativity of floating point math
If this is the case, it will be seen in the same test procedure with several models, not just Mistral. Would be great if someone can confirm that.
UPD.: May be helpful (with sublinks too): https://github.com/huggingface/transformers/issues/28732
The same issue with meta-llama/Llama-2-7b-chat-hf
.
Maybe it is resolved in new cuBLAS: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-3-update-1 ? I am using CUDA 12.1, cuBLAS 12.1.3.1
I am trying to speed up benchmarking on A100. Below are times of tests on one task in two versions using Mistral.
Unfortunately, using
torch.compile
and flash_attention slows down inference. Also, vllm is very slow for loglikelihood task.Other issue is that scores with batch size 1 and 4 differs - tested with and without
logits_cache
andtorch.use_deterministic_algorithms(True)
. Is it possible to obtain the same results? Maybe there is some problem with padding?