Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers

DreamGenX commented 5 months ago

System Info

This was tested o na tp=4 4xH100 SXM setup
I tested these 2 releases: https://github.com/NVIDIA/TensorRT-LLM/pull/1763 and https://github.com/NVIDIA/TensorRT-LLM/pull/1725

Who can help?

No response

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I built TensorRT-LLM engine in several different ways, outlined below, and compared the output quality on domain specific task that involves long inputs (typically >>2000 input tokens and >500 output tokens).

The outputs from TensorRT-LLM (obtained through running the run.py script, as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching) exhibit repepetition in the outputs ~20% of the time (sample outputs below).

When running the same with vLLM, using the same sampling params (namely temperature, presencePenalty and frequencyPenalty), the outputs do not exhibit these repetitive patterns.

Here some some of the ways I tried to build the TensorRT-LLM engine:

I tried all of float16, bfloat16 and also fp8 quantization
I tried context_fmha enable/disable and and also context_fmha_fp32_acc enable/disable
I tried use_custom_all_reduce enable/disable
I tried gemm_plugin auto/disable
I tried various values for presencePenalty and frequencyPenalty (unset, 0.05, 0.1, 0.3), bust most tests were with 0.1 for both

One concrete example:

python convert_checkpoint.py \
--model_dir /workspace/llama3-70b \
--output_dir /workspace/llama3-70b-bf16-tp4 \
--dtype bfloat16 \
--tp_size 4

trtllm-build \
--checkpoint_dir /workspace/llama3-70b-bf16-tp4 \
--output_dir /workspace/llama3-70b-bf16-tp4-engine \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--use_custom_all_reduce disable \
--max_num_tokens 16384 \
--max_batch_size 24 \
--max_input_len 8192 \
--max_output_len 4096

I also tried running sequentially without batching, and even building the engine with max_batch_size 1 to eliminate the possibility of batching related bugs (I saw there were a few before). I also once tried building with max_input_len 7424 and max_output_len 768 to eliminate the possibility of somehow messing up the RoPE (not sure if max_input_len and max_output_len actually affect that or not).

Expected behavior

The outputs should not loop that frequently, there's likely some inference inaccuracy / mismatch.

actual behavior

The input would usually be some part of a story + instruction to continue the story. This is an example of an output.

 She looks up when she hears me set down her drink.

“Martini,” I say with a smile.

She smiles back at me with her eyes this time.

“Thank you,” she says.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

The repetition is usually at a sentence level like this, but sometimes also several sentences repeat.

additional notes

I am wondering if anyone else experienced similar issues, and whether someone did a recent analysis comparing Tensort-LLM to other inference stacks. I saw that most tests are restricted to short inputs and outputs like MMLU, which might not exhibit these issues.

nv-guomingz commented 5 months ago

what's your input of above output tokens?

MagicRUBICK commented 5 months ago

add arg for converting checkpoint step: '--rotary_base 500000.0' for llama3. mmlu score for llama3-70b: 0.788

DreamGenX commented 5 months ago

@MagicRUBICK I believe that should be inferred from the HF config when converting checkpoint. Was that not your experience? Could you share your config.json? Here's mine when I did not use --rotary_base: https://github.com/NVIDIA/TensorRT-LLM/issues/1780#issue-2351680259 you can see it has pretrained_config > rotary_base: 500000.0 anyway.

ZihanLiao commented 4 months ago

Facing the same issue.

netanel-haber commented 4 months ago

@ZihanLiao, can you elaborate please?

netanel-haber commented 4 months ago

Hello @DreamGenX - I've been working on replicating your issue - namely, by first running generation on HF, as you mentioned in the title.

Note: I assumed you were using llama3-70b - not the llama3-70b-instruct. Please correct me if I'm wrong.

I wrote a small script that replicates your environment, based on what I inferred from your report, see below. Currently, I see that running almost the default generation config, with a very large max_new_tokens=4096 and a relatively low temperature=0.1 commonly produces looping text similar to your output above [Passing a high repetition_penalty does rectify this] - so I'm currently under the assumption that this behavior is due to the model, not trtllm per se.

Maybe you can elaborate as to the exact replication needed to where HF and trtllm differ significantly.

Replication Script

nvidia-docker run -v <Your Path>:/workspace/model:rw -it nvcr.io/nvidia/pytorch:24.06-py3 bash
cd ~
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/
git checkout db4edea1e1359bcfcac7bbb87c1b639b5611c721 # The commit SHA from the later release you linked to above: https://github.com/NVIDIA/TensorRT-LLM/pull/1763 
cd ..
python -m pip install virtualenv
python -m virtualenv .venv
source .venv/bin/activate
python -m pip install -r TensorRT-LLM/examples/llama/requirements.txt
python script.py

script.py:

import transformers
import torch
from pathlib import Path

REPETITION_PENALTY = 1.0 # This is the default repetition_penalty in transformers, and it means no penalty - see image below
model_path = Path("/workspace/llama3-70b")

pipeline = transformers.pipeline("text-generation", model=Path("/app/llama/hf"), model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")

result = pipeline("Hey how are you doing today?", temperature=0.1, max_new_tokens=4096, repetition_penalty=REPETITION_PENALTY)
print(result)

Default repetition_penalty=1 Default `repetition_penalty=1`

DreamGenX commented 4 months ago

Hi @netanel-haber thank you for looking into this. I am using a custom fine tune of the Llama 3 70B instruct model. The favorable results were with vLLM, using:

temperature 1.0
frequency penalty 0.1
presence penalty 0.1

To make the comparison fair I did not use samplers like min P which TensorRT-LLM does not support.

netanel-haber commented 4 months ago

I see - is there public access to said finetune?

DreamGenX commented 4 months ago

@netanel-haber I can share access if you provide your HF username.

netanel-haber commented 4 months ago

Sure:

nhaber@nvidia.com

Nvidia-NetanelHaber

Thank you.

DreamGenX commented 4 months ago

Awesome, shared together with some example inputs (not all will trigger repetition, with with TRT roughly 15-30% should).

netanel-haber commented 4 months ago

Received, thanks!

netanel-haber commented 4 months ago

Hey - sorry for the delay.

I hope I'm not missing something trivial/critical here - I suspect the discrepancy may be due to greedy sampling when running TRTLLM.

TRTLLM

When using run.py, the default is top_k=1, i.e. greedy sampling:

parser = add_common_args(parser) -> parser.add_argument('--top_k', type=int, default=1).

vLLM

vLLM, on the other hand, defaults to top_k=-1, which considers all tokens.

I ran vLLM with your fine-tune, on the sample inputs you provided, with temperature=0.0 for greedy sampling, and the penalties you provided.

The result showed numerous occurrences of looping similar to what you provided above. (I can provide outputs privately or publicly, if you prefer).

One hesitation is because you mentioned:

...as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching)...

Could you provide more context as to how you ran the GptManager (i.e., provide an actual snippet, if you don't mind)? Since there are many possible entry points for generation, I had trouble establishing for a fact that all of your TensorRT-LLM generations used top_k=1.

Let me know if this makes sense to you!

DreamGenX commented 4 months ago

@netanel-haber No worries.

I did several runs, one was top_k=50; top_p=0.9 and it also had that issue. But it's possible that something has changed in the meantime. Could you please share which commit / version of TRT-LLM you used and how you built your engine?

netanel-haber commented 4 months ago

Hey - I wanted to double-check before getting back to you. Altogether, I ran 5 generations on the entire dataset of 100 sample inputs you provided, all with **max_output_len=4096**, frequency_penalty=0.1, presence_penalty=0.1:

vLLM

top_k=50, top_p=0.9:
1. temperature=1.0, tp_size=4
2. temperature=1.0, tp_size=2
3. (For good measure: temperature=0.1, tp_size=2)
The greedy vllm run [temperature=0.0] mentioned above

TRTLLM

Identical to vLLM.1.1: top_k=50, top_p=0.9, temperature=1.0, tp_size=4

I'll upload all 5 to your private HF repo [Every output is delimited by "&"*36]

Conclusion

Discounting the greedy generation, and just looking at the runs with similar params (temp=1.0, top_k=50, top_p=0.9), I currently find it difficult to definitively determine which runtime/artifact generates text that tends to loop "more" or "worse". Both have a not-insignificant amount of looping, as outputs tended to grow longer.

As the favorable framework isn't blatantly obvious to me in this case (also given the non-deterministic generation for an arbitrary trio of sensible top_k/top_p/temperature), I think the only way forward if you feel dissatisfied with my analysis would be if you provided a more rigorous, quantitative comparison result - via standard benchmark tools such as mmlu scores, etc.

How I built the TRTLLM engine, since you asked:

Checked out db4edea1e1359bcfcac7bbb87c1b639b5611c721 - the later of the two releases you mentioned building with.
Converted and built using your exact snippets above, in the original question.

Thanks for the patience.

DreamGenX commented 4 months ago

@netanel-haber Thank you for sharing your results. I will try to redo the experiments on my side -- since you can't reproduce the discrepancy, it could be that I missed some other variable between the setups.

Thanks again for your time.

Naveassaf commented 3 months ago

Closing due to inactivity. @DreamGenX , feel free to reopen/create a separate issue if the problem persist with the changes @netanel-haber suggested.

NVIDIA / TensorRT-LLM