NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.42k stars 950 forks source link

Llama 7B SmoothQuant engine fails to build #555

Open ttim opened 10 months ago

ttim commented 10 months ago

On 0.6.0 or 0.6.1 tags building Llama 7B engine fails with

Traceback (most recent call last):
  File "/code/tensorrt_llm/examples/llama/build.py", line 782, in <module>
    build(0, args)
  File "/code/tensorrt_llm/examples/llama/build.py", line 726, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/code/tensorrt_llm/examples/llama/build.py", line 653, in build_rank_engine
    tensorrt_llm_llama(*inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    return self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 369, in forward
    hidden_states = super().forward(input_ids, position_ids, use_cache,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 251, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    return self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 117, in forward
    attention_output = self.attention(hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    return self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/layers.py", line 1217, in forward
    buffer = constant(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 965, in constant
    weights = trt.Weights(np_dtype_to_trt(ndarray.dtype), ndarray.ctypes.data,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 144, in np_dtype_to_trt
    assert ret is not None, f'Unsupported dtype: {dtype}'
AssertionError: Unsupported dtype: bool

Commands to reproduce:

Command I've used taken from docs at https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/examples/llama

It works fine on 0.5.0 tag.

### Tasks
renwuli commented 10 months ago

same issue here

Harahan commented 10 months ago

I modified codes in np_dtype_to_trt. However, I got random letters as outputs...

Tracin commented 10 months ago

@ttim @renwuli @Harahan Hi, please add --use_gpt_attention_plugin option. Building without this option is highly not recommended. Also, --remove_input_padding and --enable_context_fmha can improve on perf and memory usage.