EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
7.09k stars 1.91k forks source link

Speed up inference problems #1625

Open djstrong opened 8 months ago

djstrong commented 8 months ago

I am trying to speed up benchmarking on A100. Below are times of tests on one task in two versions using Mistral.

image

Unfortunately, using torch.compile and flash_attention slows down inference. Also, vllm is very slow for loglikelihood task.

Other issue is that scores with batch size 1 and 4 differs - tested with and without logits_cache and torch.use_deterministic_algorithms(True). Is it possible to obtain the same results? Maybe there is some problem with padding?

haileyschoelkopf commented 8 months ago

Hi!

It's very surprising to me how slow vllm is. What batch size is vllm using? auto would be the ideal. I faintly recall some issue mentioning vllm gets slowed down by having to return logprobs--@baberabb do you happen to know whether this is the case by chance?

LSinev commented 8 months ago

@djstrong Previous issue about batch size affecting predictions: https://github.com/EleutherAI/lm-evaluation-harness/issues/704#issuecomment-1670189773.

It is still a good idea to check if proper padding and position_ids are applied to models in batch mode — different HF transformers model classes have different approaches to undefined batch inputs in model.generate and model.forward, starting from GPT2LMHeadModel at least.

Those improvements will help others to dive deeply:

djstrong commented 8 months ago

@haileyschoelkopf
The scores for bs=1 is 0.7033 and for bs=4 0.7111 (with stderr 0.01). Logprobs are different for bs=1 and bs=4: image

Flash attention without compile causes error on my setup:

RuntimeError: Failed to import transformers.models.mistral.modeling_mistral because of the following error (look up to see its traceback):
venv/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

vllm was running with bs=4. Loglikelihood tasks with vllm are 4 times slower (tested with bs=1 and bs=4) than hf.

djstrong commented 8 months ago

@LSinev

  1. I expect that vllm will be also faster for loglikelihood tasks. transformers 4.39.1, vllm 0.3.2, this repo state is from yesterday cffc1bd3fd69453eaa75da891256682123226f0f
  2. Nothing special. I have bolded the best times, so hf is faster in loglikelihood, but vllm in generate_until.
  3. Times for generate_until are missing because bs is too big.
  4. Normal means hf.
  5. Torch 2.1.2
  6. max_gen_toks: 50
djstrong commented 8 months ago

You can replicate vllm loglikelihood slowness and different scores with e.g. lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1" --output_path "date/"date +%s--tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 1 --log_samples

hf bs=1 00:52 0.3878 hf bs=4 00:23 0.39 vllm bs=1 02:49 0.39 vllm bs=4 01:38 0.3878

haileyschoelkopf commented 8 months ago

Thanks!

I'd recommend trying auto batch size for vllm and seeing if that helps the speed.

Those sorts of differences in logprobs are expected when batch size changes--the issue @LSinev linked is a good one to reference. Unfortunately this can't very easily be "fixed", but the differences in logprobs from it should be very tiny, as you're seeing, and should likely not cause deviations that exceed stderr.

baberabb commented 8 months ago

Like @haileyschoelkopf said, I think for a fair comparison, you should use a bs auto to take advantage of vLLM's continuous batching. Don't know if it slows down when logprobs are returned, but most of the tweaks in vllm are kv-cache related so makes sense it doesn't do so well with non-generation tasks. They also have experimental support for prefix caching across batches (pass enable_prefix_caching=True to model_args, might have to add it to the model init to format the boolean correctly), which might speed things up (esp. for fewshot prompts).

mistral was particularly sensitive to batch differences: see #1425. Not sure what the reason was. Llama by comparison, not so much.

djstrong commented 8 months ago

Thank you!

bs auto usually doesn't work and also this is the case: OOM vllm bs=auto OOM vllm bs=32 OOM vllm bs=16 01:31 0.3856

baberabb commented 8 months ago

Thank you!

bs auto usually doesn't work and also this is the case: OOM vllm bs=auto OOM vllm bs=32 OOM vllm bs=16 01:31 0.3856

Oh, thats probably because of the large sequence length for mistral-7B (iirc defaults to ~32000, vllm preallocates memory according to that). You can set it lower to 2048 or 4096 with max_model_len. Lowering the gpu_memory_utilization from a default of 0.9 also helps with OOMs sometimes (but the former should be enough).

djstrong commented 8 months ago

Thanks! vllm bs=auto max_model_len=4096 01:33 (+01:30 for Processed prompts?) 0.3856

haileyschoelkopf commented 8 months ago

Awesome! Added both these points to the readme in #1633 as this will probably confuse others too.

djstrong commented 8 months ago

Using bs auto with vllm is causing some extra time for "Processed prompts" - I don't know what it is but finally it is slower than bs=1.

Remaining issues are:

djstrong commented 8 months ago

About different scores with different batch sizes. I have run evaluation with max_len=1, 2 examples and bs 1 vs. 2.

lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1,max_length=1" --output_path "date/"`date +%s` --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 1 --log_samples --limit 2
lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1,max_length=1" --output_path "date/"`date +%s` --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 2 --log_samples --limit 2

Logs: https://www.diffchecker.com/CpV3RaDU/ (scores are the same with logits_cache=False)

djstrong commented 8 months ago

I have found exact place: https://github.com/EleutherAI/lm-evaluation-harness/blob/e9d429e105fa95dd4a1b5606b306289d207fcf62/lm_eval/models/huggingface.py#L1049 and replicated with minimal code (I get the same numbers in this line).

Model loaded on CPU with bfloat16 gives the same numbers:

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype="auto")
>>> model(torch.tensor([[28994]])).logits
tensor([[[-11.5000, -11.2500,   3.2656,  ...,  -5.7500,  -2.7500,  -1.4375]]],
       grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]])).logits
tensor([[[-11.5000, -11.2500,   3.2656,  ...,  -5.7500,  -2.7500,  -1.4375]],

        [[-11.5000, -11.2500,   3.2656,  ...,  -5.7500,  -2.7500,  -1.4375]]],
       grad_fn=<ToCopyBackward0>)

Model on GPU with bfloat16 gives different results:

>>> model.to('cuda')
>>> model(torch.tensor([[28994]]).to('cuda')).logits
tensor([[[-11.5625, -11.3125,   3.3281,  ...,  -5.7500,  -2.6562,  -1.3906]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]]).to('cuda')).logits
tensor([[[-11.6250, -11.3125,   3.1406,  ...,  -5.7812,  -2.6562,  -1.3516]],

        [[-11.6250, -11.3125,   3.1406,  ...,  -5.7812,  -2.6562,  -1.3516]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

model on GPU with float16 gives the same results:

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, device_map='cuda')
>>> model(torch.tensor([[28994]]).to('cuda')).logits
tensor([[[-11.6406, -11.3828,   3.2852,  ...,  -5.7617,  -2.6348,  -1.3047]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]]).to('cuda')).logits
tensor([[[-11.6406, -11.3828,   3.2852,  ...,  -5.7578,  -2.6211,  -1.2930]],

        [[-11.6406, -11.3828,   3.2852,  ...,  -5.7578,  -2.6211,  -1.2930]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

So the problem is only on GPU with bfloat16. Using:

>>> torch.use_deterministic_algorithms(True)
>>> torch.backends.cudnn.benchmark = False

does not help.

LSinev commented 8 months ago

So the problem is only on GPU with bfloat16

So the problem is actually with model implementation somewhere in huggingface transformers or some specific modules in torch, and may be sorted out at their repos if issued, not within lm-evaluation-harness repository?

djstrong commented 8 months ago

Why do you think it is a problem with model implementation? But yes, it is not related to lm-evaluation-harness repository. Maybe it is some GPU optimization (cuBLAS?).

haileyschoelkopf commented 8 months ago

Nice bug hunting!

I think this is related to the lower precision of bfloat16 compared to float16. So doing adds in a different order somewhere (because different batch size launches a different kernel) is causing errors due to non-associativity of floating point math

LSinev commented 8 months ago

lower precision of bfloat16 compared to float16. So doing adds in a different order somewhere (because different batch size launches a different kernel) is causing errors due to non-associativity of floating point math

If this is the case, it will be seen in the same test procedure with several models, not just Mistral. Would be great if someone can confirm that.

UPD.: May be helpful (with sublinks too): https://github.com/huggingface/transformers/issues/28732

djstrong commented 8 months ago

The same issue with meta-llama/Llama-2-7b-chat-hf.

Maybe it is resolved in new cuBLAS: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-3-update-1 ? I am using CUDA 12.1, cuBLAS 12.1.3.1