NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.2k stars 909 forks source link

Can I load model with different precision using different precision? #903

Open Anindyadeep opened 8 months ago

Anindyadeep commented 8 months ago

I have this sample script:

engine_dir_path = Path(engine_path)
config_path = engine_dir_path / 'config.json'

with open(config_path) as f:
    config = json.load(f)

use_gpt_attention_plugin = config["plugin_config"][
    "gpt_attention_plugin"
]
remove_input_padding = config["plugin_config"]["remove_input_padding"]
tp_size = config["builder_config"]["tensor_parallel"]
pp_size = config["builder_config"]["pipeline_parallel"]
world_size = tp_size * pp_size

num_heads = config["builder_config"]["num_heads"] // tp_size
hidden_size = config["builder_config"]["hidden_size"] // tp_size
vocab_size = config["builder_config"]["vocab_size"]
num_layers = config["builder_config"]["num_layers"]
num_kv_heads = config["builder_config"].get("num_kv_heads", num_heads)
paged_kv_cache = config["plugin_config"]["paged_kv_cache"]

num_kv_heads = (num_kv_heads + tp_size - 1) // tp_size

model_config = ModelConfig(
    num_heads=num_heads,
    num_kv_heads=num_kv_heads,
    hidden_size=hidden_size,
    vocab_size=vocab_size,
    num_layers=num_layers,
    gpt_attention_plugin=use_gpt_attention_plugin,
    paged_kv_cache=paged_kv_cache,
    remove_input_padding=remove_input_padding,
)

Now I build the engine file for float32 precision like this:

python3 ./examples/llama/build.py --model_dir /models/llama-2-7b-hf --dtype float32  --max_batch_size 1 --max_input_len 3000 --max_output_len 1024 --output_dir /tensorrt_nvidia_build

Now, with this same precision, does it provide typecast to different precision somewhere? Or do I need to compile for different precisions?

byshiue commented 8 months ago

You need to compile a new engine if you want to run inference under another precision.

Anindyadeep commented 8 months ago

I see, got it. Thanks

Anindyadeep commented 8 months ago

Ohh, also just a quick question, but does TensorRT runs int-8 and int-4 quantization (I mean I saw it in the code, which it does with AWQ under the hood, correct me if I am wrong).

So can you please tell me where I can find some documentation steps to understand how to build with quantization and also do this mean that per precision, we need to have different builds? Is there any documentation that does it?

byshiue commented 8 months ago

You could find the scripts of building quantized model of llama in https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama (other models have similar documents)

If you want to run int4-AWQ and int8-weight-only, you need to build two separate engines.