Open matichon-vultureprime opened 1 year ago
From experience: The basic int4 weight only mode (not GPTQ, AWQ) should not be used. Even the 70B model generated nonsense when using it. Try the int8 mode instead, maybe it fits. LLM runtimes in general have limited support for older Volta and Turing generations. Maybe give llama.cpp a try. It wasn't particularly fast compared to TensorRT-LLM, but it's easy to use and their 5bit quantization mode is quite accurate.
Hi folk, Recently, I carried out a test that I'd like to share with all of you.
Hypothesis: Llama2 int4 weight (weight only) should work all across architecture (SM70, SM75, SM80, SM86, SM89 ,SM90)
Result. T4 (SM75) int4 TRT-LLM backend produces incorrect output. T4 (SM75) fp16 TRT-LLM backend produces correct output. V100 (SM70) int4 TRT-LLM backend produces correct output. V100 (SM70) fp16 TRT-LLM backend produces correct output. A10G (SM80) int4 TRT-LLM backend produces correct output. A10G (SM80) fp16 TRT-LLM backend produces correct output. A100 (SM80) int4 TRT-LLM backend produces correct output. A100 (SM80) fp16 TRT-LLM backend produces correct output.
Wish this report will helpful.
@matichon-vultureprime Thanks for reporting this. We haven't thoroughly validated TensorRT-LLM on Turing hardware.
Let me bring this to our product team's attention firstly.
Hello folks,
I am looking to build the llama7b int4 weight and serve via Triton. I attempted constructing it and verifying whether the int4 output is correct.
However, when I built it with
use_inflight_batching
andpaged_kv_cache
and served it via Triton, I got a different output from what I previously had.I have provide my build step and output. 1.
2.
3.
When i convert into inflight and paged. 1.
2.Start script.