Closed DreamGenX closed 3 months ago
what's your input of above output tokens?
add arg for converting checkpoint step: '--rotary_base 500000.0' for llama3. mmlu score for llama3-70b: 0.788
@MagicRUBICK I believe that should be inferred from the HF config when converting checkpoint. Was that not your experience? Could you share your config.json? Here's mine when I did not use --rotary_base
: https://github.com/NVIDIA/TensorRT-LLM/issues/1780#issue-2351680259 you can see it has pretrained_config > rotary_base: 500000.0
anyway.
Facing the same issue.
@ZihanLiao, can you elaborate please?
Hello @DreamGenX - I've been working on replicating your issue - namely, by first running generation on HF, as you mentioned in the title.
Note: I assumed you were using llama3-70b - not the llama3-70b-instruct. Please correct me if I'm wrong.
I wrote a small script that replicates your environment, based on what I inferred from your report, see below.
Currently, I see that running almost the default generation config, with a very large max_new_tokens=4096
and a relatively low temperature=0.1
commonly produces looping text similar to your output above [Passing a high repetition_penalty
does rectify this] - so I'm currently under the assumption that this behavior is due to the model, not trtllm
per se.
Maybe you can elaborate as to the exact replication needed to where HF
and trtllm
differ significantly.
nvidia-docker run -v <Your Path>:/workspace/model:rw -it nvcr.io/nvidia/pytorch:24.06-py3 bash
cd ~
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/
git checkout db4edea1e1359bcfcac7bbb87c1b639b5611c721 # The commit SHA from the later release you linked to above: https://github.com/NVIDIA/TensorRT-LLM/pull/1763
cd ..
python -m pip install virtualenv
python -m virtualenv .venv
source .venv/bin/activate
python -m pip install -r TensorRT-LLM/examples/llama/requirements.txt
python script.py
script.py
:
import transformers
import torch
from pathlib import Path
REPETITION_PENALTY = 1.0 # This is the default repetition_penalty in transformers, and it means no penalty - see image below
model_path = Path("/workspace/llama3-70b")
pipeline = transformers.pipeline("text-generation", model=Path("/app/llama/hf"), model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")
result = pipeline("Hey how are you doing today?", temperature=0.1, max_new_tokens=4096, repetition_penalty=REPETITION_PENALTY)
print(result)
Hi @netanel-haber thank you for looking into this. I am using a custom fine tune of the Llama 3 70B instruct model. The favorable results were with vLLM, using:
To make the comparison fair I did not use samplers like min P which TensorRT-LLM does not support.
I see - is there public access to said finetune?
@netanel-haber I can share access if you provide your HF username.
Sure:
nhaber@nvidia.com
Nvidia-NetanelHaber
Thank you.
Awesome, shared together with some example inputs (not all will trigger repetition, with with TRT roughly 15-30% should).
Received, thanks!
Hey - sorry for the delay.
I hope I'm not missing something trivial/critical here - I suspect the discrepancy may be due to greedy sampling when running TRTLLM.
When using run.py
, the default is top_k=1
, i.e. greedy sampling:
parser = add_common_args(parser)
-> parser.add_argument('--top_k', type=int, default=1)
.
vLLM, on the other hand, defaults to top_k=-1
, which considers all tokens.
I ran vLLM with your fine-tune, on the sample inputs you provided, with temperature=0.0
for greedy sampling, and the penalties you provided.
The result showed numerous occurrences of looping similar to what you provided above. (I can provide outputs privately or publicly, if you prefer).
One hesitation is because you mentioned:
...as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching)...
Could you provide more context as to how you ran the GptManager (i.e., provide an actual snippet, if you don't mind)? Since there are many possible entry points for generation, I had trouble establishing for a fact that all of your TensorRT-LLM generations used top_k=1
.
Let me know if this makes sense to you!
@netanel-haber No worries.
I did several runs, one was top_k=50; top_p=0.9
and it also had that issue. But it's possible that something has changed in the meantime. Could you please share which commit / version of TRT-LLM you used and how you built your engine?
Hey - I wanted to double-check before getting back to you.
Altogether, I ran 5 generations on the entire dataset of 100 sample inputs you provided, all with **max_output_len=4096**, frequency_penalty=0.1, presence_penalty=0.1
:
top_k=50, top_p=0.9
:
temperature=1.0, tp_size=4
temperature=1.0, tp_size=2
temperature=0.1, tp_size=2
)temperature=0.0
] mentioned abovetop_k=50, top_p=0.9, temperature=1.0, tp_size=4
I'll upload all 5 to your private HF repo [Every output is delimited by "&"*36
]
Discounting the greedy generation, and just looking at the runs with similar params (temp=1.0, top_k=50, top_p=0.9
), I currently find it difficult to definitively determine which runtime/artifact generates text that tends to loop "more" or "worse". Both have a not-insignificant amount of looping, as outputs tended to grow longer.
As the favorable framework isn't blatantly obvious to me in this case (also given the non-deterministic generation for an arbitrary trio of sensible top_k/top_p/temperature), I think the only way forward if you feel dissatisfied with my analysis would be if you provided a more rigorous, quantitative comparison result - via standard benchmark tools such as mmlu scores, etc.
How I built the TRTLLM engine, since you asked:
Thanks for the patience.
@netanel-haber Thank you for sharing your results. I will try to redo the experiments on my side -- since you can't reproduce the discrepancy, it could be that I missed some other variable between the setups.
Thanks again for your time.
Closing due to inactivity. @DreamGenX , feel free to reopen/create a separate issue if the problem persist with the changes @netanel-haber suggested.
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I built TensorRT-LLM engine in several different ways, outlined below, and compared the output quality on domain specific task that involves long inputs (typically >>2000 input tokens and >500 output tokens).
The outputs from TensorRT-LLM (obtained through running the
run.py
script, as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching) exhibit repepetition in the outputs ~20% of the time (sample outputs below).When running the same with vLLM, using the same sampling params (namely temperature, presencePenalty and frequencyPenalty), the outputs do not exhibit these repetitive patterns.
Here some some of the ways I tried to build the TensorRT-LLM engine:
context_fmha enable/disable
and and alsocontext_fmha_fp32_acc enable/disable
use_custom_all_reduce enable/disable
gemm_plugin auto/disable
presencePenalty
andfrequencyPenalty
(unset, 0.05, 0.1, 0.3), bust most tests were with0.1
for bothOne concrete example:
I also tried running sequentially without batching, and even building the engine with
max_batch_size 1
to eliminate the possibility of batching related bugs (I saw there were a few before). I also once tried building withmax_input_len 7424
andmax_output_len 768
to eliminate the possibility of somehow messing up the RoPE (not sure if max_input_len and max_output_len actually affect that or not).Expected behavior
The outputs should not loop that frequently, there's likely some inference inaccuracy / mismatch.
actual behavior
The input would usually be some part of a story + instruction to continue the story. This is an example of an output.
The repetition is usually at a sentence level like this, but sometimes also several sentences repeat.
additional notes
I am wondering if anyone else experienced similar issues, and whether someone did a recent analysis comparing Tensort-LLM to other inference stacks. I saw that most tests are restricted to short inputs and outputs like MMLU, which might not exhibit these issues.