Open JoJoLev opened 1 week ago
Are you running TensorRT-LLM docker container? Try running pip install --upgrade transformers
in the container before converting/building the Llama 3.2 engine.
TRT-LLM highest transformers version is 4.42.4. Llama3.2 in the config.json wants transformers 4.45.
I tried to build quantizations for the Llama 3.2 models. Following works for meta-llama/Llama-3.2-3B-Instruct
:
time \
../scripts/quantize.py --model_dir "${MODEL_PATH}" \
--output_dir "${QUANT_PATH}-int8-kv-int4-awq" \
--dtype float16 \
--qformat w4a8_awq \
--awq_block_size 128 \
--kv_cache_dtype int8 \
--calib_size 32 \
--tp_size 4
time \
trtllm-build --checkpoint_dir "${QUANT_PATH}-int8-kv-int4-awq" \
--output_dir "${ENGINE_PATH}-int8-kv-int4-awq" \
--gemm_plugin auto \
--weight_streaming \
--max_batch_size 8
time \
mpirun -n 4 --allow-run-as-root ../scripts/summarize.py --test_trt_llm \
--hf_model_dir "${MODEL_PATH}" \
--tokenizer_dir "${MODEL_PATH}" \
--data_type fp16 \
--engine_dir "${ENGINE_PATH}-int8-kv-int4-awq" \
--test_hf
Using the tensorrt-llm==0.13.0
with transformers==4.45.2
and based on:
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.13.0/examples/llama#llama-v3-updates
Running the same quantization on meta-llama/Llama-3.2-11B-Vision-Instruct
yields:
# trtllm-build --checkpoint_dir quantizations/meta-llama--llama-3.2-11b-vision-instruct-int8-kv-int4-awq --output_dir engines/meta-llama--llama-3.2-11b-vision-instruct-int8-kv-int4-awq --gemm_plugin auto --gpt_attention_plugin bfloat16 --weight_streaming --max_batch_size 8
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
[10/12/2024-16:47:32] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set gemm_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set nccl_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set lookup_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set lora_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set moe_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set context_fmha to True.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set remove_input_padding to True.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set reduce_fusion to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set enable_xqa to True.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set tokens_per_block to 64.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set multiple_profiles to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set paged_state to True.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set streamingllm to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set use_fused_mlp to True.
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.15.1'}
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 575, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 429, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 396, in build_and_save
engine = build_model(build_config,
File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 365, in build_model
model = model_cls.from_checkpoint(ckpt_dir, config=rank_config)
File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 483, in from_checkpoint
preprocess_weights(weights, config, from_pruned=is_checkpoint_pruned)
File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 1228, in preprocess_weights
activation_scaling_factor = weights.pop(
KeyError: 'transformer.layers.13.attention.dense.activation_scaling_factor'
Are "Vision" models supported? Or is this a support for Llama 3.2 issue? Got the same results for TensorRT-LLM 0.14.0.dev2024100800.
The plain checkpoint extraction also fails for vision models:
+ ../scripts/convert_checkpoint.py --model_dir /home/runner/.cache/huggingface/hub/models--meta-llama--Llama-3.2-11B-Vision/snapshots/3f2e93603aaa5dd142f27d34b06dfa2b6e97b8be --output_dir checkpoints/meta-llama--llama-3.2-11b-vision --dtype float16
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
0.13.0
Traceback (most recent call last):
File "/workspace/models/../scripts/convert_checkpoint.py", line 505, in <module>
main()
File "/workspace/models/../scripts/convert_checkpoint.py", line 497, in main
convert_and_save_hf(args)
File "/workspace/models/../scripts/convert_checkpoint.py", line 439, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/workspace/models/../scripts/convert_checkpoint.py", line 446, in execute
f(args, rank)
File "/workspace/models/../scripts/convert_checkpoint.py", line 425, in convert_and_save_rank
llama = LLaMAForCausalLM.from_hugging_face(
File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 320, in from_hugging_face
config = LLaMAConfig.from_hugging_face(hf_config_or_dir,
File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/models/llama/config.py", line 120, in from_hugging_face
hf_config.num_attention_heads)
File "/home/runner/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 202, in __getattribute__
return super().__getattribute__(key)
AttributeError: 'MllamaConfig' object has no attribute 'num_attention_heads'
Using the tensorrt-llm==0.13.0
with transformers==4.45.2
.
Thanks for sharing Malik, this worked for me! For some reason the checkpoint build for llama doesn't work though, I have to go through the quantization path.
@JoJoLev could you please share the code and relevant dependency versions, that worked for you? Does the trtllm-build
step work for you?
I'm seeing same error as @malikkirchner with running the convert checkpoint command. It appears the convert checkpoint cannot read the config.json file from the meta-llama/Llama-3.2-11B-Vision-Instruct.
Mine works going down the quantization.py path and building the engine from there for llama3.2-3B with the transformers version mentioned above. I cannot build an engine with the checkpoint approach in the example/llama which is how I normally go about it.
Thanks,
Jordan Leventis
On Tue, Oct 15, 2024 at 10:37 AM Chris Lennon @.***> wrote:
I'm seeing same error as @malikkirchner https://github.com/malikkirchner with running the convert checkpoint command. It appears the convert checkpoint cannot read the config.json file from the meta-llama/Llama-3.2-11B-Vision-Instruct.
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT-LLM/issues/2320#issuecomment-2414359095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMKQ2Y7PM5HRLZ6TEHLRPW3Z3UZDRAVCNFSM6AAAAABPZJWQLKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJUGM2TSMBZGU . You are receiving this because you were mentioned.Message ID: @.***>
While converting the checkpoints for llama3.2 models, facing the following error after few iterations but same was working fine for llama3-8B model, can anyone look into it? does it require to change convert_checkpoint.py file?:
117it [00:00, 649.28it/s]
Traceback (most recent call last):
File "/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 487, in
Just to report, I got a llama3.2-3B engine built on TensorRT-LLM version 13 using the quantization.py setup. I am running on Triton v24.09. Blazingly fast.
Can you share how you got it up and running on Triton?
Is there support for llama3.2 with TensorRT-LLM? I tried engine build but got a rope error? Maybe it is related to the context length? Thanks.