Open bleedingfight opened 2 days ago
Hi @bleedingfight , do you mean that: HF transformers FP16 > HF transformers int4 > TensorRT-LLM int4 in the LLaMA3 70B model? Could you please share that how you run the HF transformers int4 in LLaMA3 70B model?
@QiJune My model is a multimodal model, which is slightly different from pure LLM. The difference is that the input of LLM is not input_ids, but a relatively long input_embeds. Then, after outputting the first token through LLM, the remaining steps are the same as the input of input_ids. Because 70Bfp16 cannot be loaded on a single card, I have tested the results on the evaluation set, which are approximately fp16=1 (8). 1s, int8=1.4s (8), int4=1.4s (8 a100) (output is yes or not) I didn't do a very detailed test, and the initial result I saw is that TensorRT llm is very slow in both awq and gptq. In my evaluation set, outputting a evaluation result takes about 8 seconds, while vllm AutoAWQ only takes about 1 second. The results tested under Transformers are similar, although the machines are different (A100), the speed is about 1 second.
transformers:load model with --load_4bit vllm: hf model->autoawq model(4bit)(2 L40S) tensorrt-llm(1L40s):hf_model-->quantize with int8_sq、w4a8_awq、int4_awq(output is error #1860 ) and hf_model---auto-gptq 4bit-->tensorrt engine(output better then awq but also not good).
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
My model is a multimodal model,vit+my_mmproject+llama3(llava1.5 arch)。
cost_time is just tensorrt-llm generate method: cost time:8.144946575164795,vllm,:2s
convert script:
run.py snip code:
llama3 to gptq:
Expected behavior
much slower
actual behavior
better than transformers bitsandbytes(4bit)?
additional notes
I don't know if gptq should be faster than transformers on the algorithm. In my actual test, transformers fp16 is faster than int4, but tensorrt llm is even slower than 4bit. Is this normal?