Open ttim opened 10 months ago
same issue here
I modified codes in np_dtype_to_trt
. However, I got random letters as outputs...
@ttim @renwuli @Harahan Hi, please add --use_gpt_attention_plugin
option. Building without this option is highly not recommended. Also, --remove_input_padding
and --enable_context_fmha
can improve on perf and memory usage.
On
0.6.0
or0.6.1
tags building Llama 7B engine fails withCommands to reproduce:
python3 examples/llama/hf_llama_convert.py -i repos/llama-2-7b -o repos/smooth_llama_2_7B/sq0.5/ -sq 0.5 --tensor-parallelism 1 --storage-type fp16
python examples/llama/build.py --ft_model_dir=repos/smooth_llama_2_7B/sq0.5/1-gpu/ --use_smooth_quant --dtype float16 --output_dir engines/llama-2-7b/sq0.5 --max_input_len 100 --max_output_len 200 --max_batch_size 512
Command I've used taken from docs at https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/examples/llama
It works fine on
0.5.0
tag.