return-generation-logits bug when fp8 enabled

binhtranmcs commented 1 month ago

I am running llama3 model on an rtx4090 with fp8 quantization. In the result, outputTokenIds seems to be correct but the generationLogits are all wrong. I also tested the same model without quantization and the returned logits are all correct, so I guess there is something wrong when returning the logits with fp8 enable.

How I tested: I deployed the model using tritonserver with tensorrtllm_backend. I changed the bls backend a bit to get the softmax of the generationLogits as well as the tokens generated. I made a call using client.txt and got the result in log.txt.

Command to run the client: python3 client.py -p "hello how are you" --model-name tensorrt_llm_bls --request-id testid --verbose -o 10 --return-generation-logits.

Please have a look. Thanks in advance!

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

binhtranmcs commented 6 days ago

Any update on this? Another insight is that when running with examples/run.py, the output generation logits are correct. Command to run:

python3 examples/run.py \
  --engine_dir ENGINE_DIR \
  --tokenizer_dir TOKENIZER_DIR \
  --max_output_len 10 \
  --output_logits_npy \
  --input_text "hello how are you"

NVIDIA / TensorRT-LLM

return-generation-logits bug when fp8 enabled #2088