Llama 7B SmoothQuant engine fails to build

ttim commented 10 months ago

On 0.6.0 or 0.6.1 tags building Llama 7B engine fails with

Traceback (most recent call last):
  File "/code/tensorrt_llm/examples/llama/build.py", line 782, in <module>
    build(0, args)
  File "/code/tensorrt_llm/examples/llama/build.py", line 726, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/code/tensorrt_llm/examples/llama/build.py", line 653, in build_rank_engine
    tensorrt_llm_llama(*inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    return self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 369, in forward
    hidden_states = super().forward(input_ids, position_ids, use_cache,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 251, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    return self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 117, in forward
    attention_output = self.attention(hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    return self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/layers.py", line 1217, in forward
    buffer = constant(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 965, in constant
    weights = trt.Weights(np_dtype_to_trt(ndarray.dtype), ndarray.ctypes.data,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 144, in np_dtype_to_trt
    assert ret is not None, f'Unsupported dtype: {dtype}'
AssertionError: Unsupported dtype: bool

Commands to reproduce:

firstly prepare the weights python3 examples/llama/hf_llama_convert.py -i repos/llama-2-7b -o repos/smooth_llama_2_7B/sq0.5/ -sq 0.5 --tensor-parallelism 1 --storage-type fp16
build the engine python examples/llama/build.py --ft_model_dir=repos/smooth_llama_2_7B/sq0.5/1-gpu/ --use_smooth_quant --dtype float16 --output_dir engines/llama-2-7b/sq0.5 --max_input_len 100 --max_output_len 200 --max_batch_size 512

Command I've used taken from docs at https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/examples/llama

It works fine on 0.5.0 tag.

### Tasks

renwuli commented 10 months ago

same issue here

Harahan commented 10 months ago

I modified codes in np_dtype_to_trt. However, I got random letters as outputs...

Tracin commented 10 months ago

@ttim @renwuli @Harahan Hi, please add --use_gpt_attention_plugin option. Building without this option is highly not recommended. Also, --remove_input_padding and --enable_context_fmha can improve on perf and memory usage.

NVIDIA / TensorRT-LLM

Llama 7B SmoothQuant engine fails to build #555