NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.19k stars 908 forks source link

return-generation-logits bug when fp8 enabled #2088

Open binhtranmcs opened 1 month ago

binhtranmcs commented 1 month ago

I am running llama3 model on an rtx4090 with fp8 quantization. In the result, outputTokenIds seems to be correct but the generationLogits are all wrong. I also tested the same model without quantization and the returned logits are all correct, so I guess there is something wrong when returning the logits with fp8 enable.

How I tested: I deployed the model using tritonserver with tensorrtllm_backend. I changed the bls backend a bit to get the softmax of the generationLogits as well as the tokens generated. I made a call using client.txt and got the result in log.txt.

Command to run the client: python3 client.py -p "hello how are you" --model-name tensorrt_llm_bls --request-id testid --verbose -o 10 --return-generation-logits.

Please have a look. Thanks in advance!

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

binhtranmcs commented 6 days ago

Any update on this? Another insight is that when running with examples/run.py, the output generation logits are correct. Command to run:

python3 examples/run.py \
  --engine_dir ENGINE_DIR \
  --tokenizer_dir TOKENIZER_DIR \
  --max_output_len 10 \
  --output_logits_npy \
  --input_text "hello how are you"