NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.63k stars 984 forks source link

"use_embedding_sharing" option not working for llama model. #2226

Closed jxchenus closed 3 weeks ago

jxchenus commented 1 month ago

System Info

Who can help?

@kaiyux

Information

Tasks

Reproduction

We have a customized llama model with shared embeddings. We were able to convert the model with TensorRT-LLM v0.9.0 convert-checkpoint.py script and "use_embedding_sharing" option:

python /opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/examples/llama/convert_checkpoint.py \ --model_dir $CONTAINER_LLM_CKPT_DIR \ --output_dir $CONTAINER_LLM_CKPT_DIR/ckpt_tp$LLM_TP_DEGREE/ \ --tp_size $LLM_TP_DEGREE \ --dtype bfloat16 \ --use_embedding_sharing \ --use_parallel_embedding

When we use the same command on the same model in TensorRT-LLM main branch, we find the following warning:

"lm_head.weight and transformer.vocab_embedding.weight are not identical..."

logged in check_share_embedding function. This function requires that lm_head_weight and vocab_embed_weights are both available and are identical, but in our case lm_head_weight is None; therefore, check_share_embedding returns False.

Furthermore, if I modify the above function to return True when lm_head_weight is None, a KeyError is thrown from this line. This is because of this line.

Finally, a question about this line. Why does weights need to be cloned?

Expected behavior

convert_checkpoint.py should honor the behavior as v0.9.0.

actual behavior

convert_checkpoint.py does not honor embedding sharing even when --use_embedding_sharing option is specified.

additional notes

Also tried the above command with: TRTLLM_DISABLE_UNIFIED_CONVERTER=1 and got a different error: Traceback (most recent call last): File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/examples/llama/convert_checkpoint.py", line 497, in main() File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/examples/llama/convert_checkpoint.py", line 489, in main convert_and_save_hf(args) File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/examples/llama/convert_checkpoint.py", line 431, in convert_and_save_hf execute(args.workers, [convert_and_save_rank] * world_size, args) File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/examples/llama/convert_checkpoint.py", line 438, in execute f(args, rank) File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/examples/llama/convert_checkpoint.py", line 417, in convert_and_save_rank llama = LLaMAForCausalLM.from_hugging_face( File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/tensorrt_llm/models/llama/model.py", line 373, in from_hugging_face check_share_embedding(weights, config) File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/tensorrt_llm/models/modeling_utils.py", line 1290, in check_share_embedding if (weights["lm_head.weight"] - TypeError: unsupported operand type(s) for -: 'NoneType' and 'Tensor'

lkm2835 commented 1 month ago

Hi @jxchenus, can you check this PR #2232?

jxchenus commented 1 month ago

Hi @lkm2835 Thank you for looking into this! I looked at your PR. Unfortunately I don't believe it addresses the issue I was seeing. In my case, my model does not contain weights for lm_head.weight, and the following warning is logged in model_weights_loader.py's check_share_embedding function:

"lm_head.weight and transformer.vocab_embedding.weight are not identical..."

So in my test case, modeling_utils.py's check_share_embedding is not in the code path.

Barry-Delaney commented 1 month ago

Hi @jxchenus, thanks for the feedback. Currently, the ModelWeightsLoader cannot handle the case for shared embedding. This line need to get updated for your case. It should add another condition when one of vocab_embedding and lm_head is None.

Let me fix it and update here once the fix got merged into this repo.

jxchenus commented 1 month ago

Thank you @Barry-Delaney ! I verified the patch provided by Vivian Chen @Nvdia.

I still have a question about this line. Why do the weights need to be cloned?

Barry-Delaney commented 1 month ago

This is a leftover from previous commits to keep the postprocess() function works under similar cases, if they share the same portion of weights with no --use_embedding_sharing enabled, the function will duplicate it. This will be removed later.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."