NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.38k stars 943 forks source link

Support for Mistral Nemo #1985

Open hongjunchoi92 opened 2 months ago

hongjunchoi92 commented 2 months ago

https://mistral.ai/news/mistral-nemo/

Would Mistral Nemo Models be supported in Tensorrt-LLM in near future?

fan-niu commented 2 months ago

@byshiue Looking forward to any progress

hongjunchoi92 commented 2 months ago

Hello @byshiue

It seems like Mistral 7B model is already supported https://github.com/NVIDIA/TensorRT-LLM/blob/5ddb6bf218ed16a2dcf0058f20c59a247e180fd2/examples/llama/README.md?plain=1#L1072

If the model architecture is the same, would that mean that we can also use existing scripts / code for Mistral-Nemo as well? Or would the model architecture difference require new code changes?

Would be happy to try out with existing scripts. Please let us know.

cc: @AdamzNV @ncomly-nvidia as well.

fan-niu commented 2 months ago

@byshiue @AdamzNV @ncomly-nvidia Can you help solve this problem? Yesterday I tried to directly use the mistral method to convert and compile the mistral nemo 12b engine, but an error occurred during the conversion phase. I use the smoothquant conversion method. The following is the conversion script and error log. CC: @hongjunchoi92

Convert script: tensorrtllm commit : ab49b937 (use this commit for llama3 + rope scaling) tensorrtllm backend commit: 97feb8f python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${model_path} --output_dir ${convert_model_path} --dtype float16 --smoothquant 0.5 --per_token --per_channel --tp_size 1

Error log: [TensorRT-LLM] TensorRT-LLM version: 0.11.0 0.11.0 Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Traceback (most recent call last): File "/code/./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 461, in <module> main() File "/code/./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 453, in main convert_and_save_hf(args) File "/code/./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 339, in convert_and_save_hf LLaMAForCausalLM.quantize( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 411, in quantize convert.quantize(hf_model_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1226, in quantize hf_model = AutoModelForCausalLM.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3838, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py", line 362, in set_module_tensor_to_device raise ValueError( ValueError: Trying to set a tensor of shape torch.Size([1024, 5120]) in "weight" (which has shape torch.Size([1280, 5120])), this look incorrect. ][TensorRT-LLM] TensorRT-LLM version: 0.11.0

eleapttn commented 2 months ago

Hello everyone!

Same issue here. Any news about the integration of this model? Is it related to transformers version and this PR? https://github.com/huggingface/transformers/pull/32050

The logs are the following (pp_size and tp_size at 1)

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 465, in load
    param.value = weights[name]
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 133, in value
    assert v.shape == self.shape, \
AssertionError: The value updated is not the same shape as the original. Updated: (6144, 5120), original: (7680, 5120)
QiJune commented 2 months ago

@nv-guomingz Could you please take a look? Thanks

nv-guomingz commented 2 months ago

Hi @eleapttn ,we've fixed this issue internally and corresponding fixing will be pushed to main branch in coming weekly update.

eleapttn commented 2 months ago

Hi @QiJune, @nv-guomingz, Thanks a lot for your quick reply. I can't wait to test it!

MatthewPeyrard commented 1 month ago

This is working in 0.12. Good job! Does anyone have any advice or documentation that can help to optimize engine builds for Mistral Nemo? I am currently experimenting with fp8 quants on an H100 and finding them to be about 1/3 the speed of a similar quant of Llama 3.1 8B. I expected Nemo to be a bit slower, but not that much slower.