Open hongjunchoi92 opened 2 months ago
@byshiue Looking forward to any progress
Hello @byshiue
It seems like Mistral 7B model is already supported https://github.com/NVIDIA/TensorRT-LLM/blob/5ddb6bf218ed16a2dcf0058f20c59a247e180fd2/examples/llama/README.md?plain=1#L1072
If the model architecture is the same, would that mean that we can also use existing scripts / code for Mistral-Nemo as well? Or would the model architecture difference require new code changes?
Would be happy to try out with existing scripts. Please let us know.
cc: @AdamzNV @ncomly-nvidia as well.
@byshiue @AdamzNV @ncomly-nvidia Can you help solve this problem? Yesterday I tried to directly use the mistral method to convert and compile the mistral nemo 12b engine, but an error occurred during the conversion phase. I use the smoothquant conversion method. The following is the conversion script and error log. CC: @hongjunchoi92
Convert script:
tensorrtllm commit : ab49b937 (use this commit for llama3 + rope scaling)
tensorrtllm backend commit: 97feb8f
python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${model_path} --output_dir ${convert_model_path} --dtype float16 --smoothquant 0.5 --per_token --per_channel --tp_size 1
Error log:
[TensorRT-LLM] TensorRT-LLM version: 0.11.0 0.11.0 Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Traceback (most recent call last): File "/code/./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 461, in <module> main() File "/code/./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 453, in main convert_and_save_hf(args) File "/code/./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 339, in convert_and_save_hf LLaMAForCausalLM.quantize( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 411, in quantize convert.quantize(hf_model_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1226, in quantize hf_model = AutoModelForCausalLM.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3838, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py", line 362, in set_module_tensor_to_device raise ValueError( ValueError: Trying to set a tensor of shape torch.Size([1024, 5120]) in "weight" (which has shape torch.Size([1280, 5120])), this look incorrect. ][TensorRT-LLM] TensorRT-LLM version: 0.11.0
Hello everyone!
Same issue here. Any news about the integration of this model? Is it related to transformers version and this PR? https://github.com/huggingface/transformers/pull/32050
The logs are the following (pp_size
and tp_size
at 1
)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 465, in load
param.value = weights[name]
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 133, in value
assert v.shape == self.shape, \
AssertionError: The value updated is not the same shape as the original. Updated: (6144, 5120), original: (7680, 5120)
@nv-guomingz Could you please take a look? Thanks
Hi @eleapttn ,we've fixed this issue internally and corresponding fixing will be pushed to main branch in coming weekly update.
Hi @QiJune, @nv-guomingz, Thanks a lot for your quick reply. I can't wait to test it!
This is working in 0.12. Good job! Does anyone have any advice or documentation that can help to optimize engine builds for Mistral Nemo? I am currently experimenting with fp8 quants on an H100 and finding them to be about 1/3 the speed of a similar quant of Llama 3.1 8B. I expected Nemo to be a bit slower, but not that much slower.
https://mistral.ai/news/mistral-nemo/
Would Mistral Nemo Models be supported in Tensorrt-LLM in near future?