Open Anindyadeep opened 8 months ago
You need to compile a new engine if you want to run inference under another precision.
I see, got it. Thanks
Ohh, also just a quick question, but does TensorRT runs int-8 and int-4 quantization (I mean I saw it in the code, which it does with AWQ under the hood, correct me if I am wrong).
So can you please tell me where I can find some documentation steps to understand how to build with quantization and also do this mean that per precision, we need to have different builds? Is there any documentation that does it?
You could find the scripts of building quantized model of llama in https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama (other models have similar documents)
If you want to run int4-AWQ and int8-weight-only, you need to build two separate engines.
I have this sample script:
Now I build the engine file for
float32
precision like this:Now, with this same precision, does it provide typecast to different precision somewhere? Or do I need to compile for different precisions?