Open binhtranmcs opened 1 month ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
Any update on this? Another insight is that when running with examples/run.py
, the output generation logits are correct.
Command to run:
python3 examples/run.py \
--engine_dir ENGINE_DIR \
--tokenizer_dir TOKENIZER_DIR \
--max_output_len 10 \
--output_logits_npy \
--input_text "hello how are you"
I am running llama3 model on an rtx4090 with fp8 quantization. In the result,
outputTokenIds
seems to be correct but thegenerationLogits
are all wrong. I also tested the same model without quantization and the returned logits are all correct, so I guess there is something wrong when returning the logits with fp8 enable.How I tested: I deployed the model using tritonserver with tensorrtllm_backend. I changed the bls backend a bit to get the softmax of the
generationLogits
as well as the tokens generated. I made a call using client.txt and got the result in log.txt.Command to run the client:
python3 client.py -p "hello how are you" --model-name tensorrt_llm_bls --request-id testid --verbose -o 10 --return-generation-logits
.Please have a look. Thanks in advance!