NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.74k stars 999 forks source link

Error convert_checkpoint in TensorRT-LLM 0.13.0 for Llama3.2 3B #2467

Open yspch2022 opened 5 days ago

yspch2022 commented 5 days ago

hellow, I failed to covert trt-llm Llama3.2 3B when I tried to run convert_checkpoint.py. (like this link - https://github.com/NVIDIA/TensorRT-LLM/issues/2339) I want to know if Llama3.2 3B model conversion is not supported now. Also, I want to know when the model conversion is supported.


envs windows & ubuntu 22.04 trt-llm 0.13.0

error msg

[TensorRT-LLM] TensorRT-LLM version: 0.13.0 0.13.0 201it [00:08, 24.13it/s] Traceback (most recent call last): File "D:\test_llama\TensorRT-LLM-0.13.0\examples\llama\convert_checkpoint.py", line 503, in main() File "D:\test_llama\TensorRT-LLM-0.13.0\examples\llama\convert_checkpoint.py", line 495, in main convert_and_save_hf(args) File "D:\test_llama\TensorRT-LLM-0.13.0\examples\llama\convert_checkpoint.py", line 437, in convert_and_save_hf execute(args.workers, [convert_and_save_rank] * world_size, args) File "D:\test_llama\TensorRT-LLM-0.13.0\examples\llama\convert_checkpoint.py", line 444, in execute f(args, rank) File "D:\test_llama\TensorRT-LLM-0.13.0\examples\llama\convert_checkpoint.py", line 423, in convert_and_save_rank llama = LLaMAForCausalLM.from_hugging_face( File "D:\envs\test_llama-v4\lib\site-packages\tensorrt_llm\models\llama\model.py", line 358, in from_hugging_face loader.generate_tllm_weights(model) File "D:\envs\test_llama-v4\lib\site-packages\tensorrt_llm\models\model_weights_loader.py", line 357, in generate_tllm_weights self.load(tllm_key, File "D:\envs\test_llama-v4\lib\site-packages\tensorrt_llm\models\model_weights_loader.py", line 278, in load v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs) File "D:\envs\test_llama-v4\lib\site-packages\tensorrt_llm\layers\linear.py", line 391, in postprocess weights = weights.to(str_dtype_to_torch(self.dtype)) AttributeError: 'NoneType' object has no attribute 'to' Exception ignored in: <function PretrainedModel.del at 0x000001C175AF3640> Traceback (most recent call last): File "D:\envs\test_llama-v4\lib\site-packages\tensorrt_llm\models\modeling_utils.py", line 453, in del self.release() File "D:\envs\test_llama-v4\lib\site-packages\tensorrt_llm\models\modeling_utils.py", line 450, in release release_gc() File "D:\envs\test_llama-v4\lib\site-packages\tensorrt_llm_utils.py", line 471, in release_gc torch.cuda.ipc_collect() File "D:\envs\test_llama-v4\lib\site-packages\torch\cuda__init.py", line 904, in ipc_collect _lazy_init() File "D:\envs\test_llama-v4\lib\site-packages\torch\cuda\init__.py", line 333, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

CUDA call was originally invoked at:

File "D:\test_llama\TensorRT-LLM-0.13.0\examples\llama\convert_checkpoint.py", line 8, in from transformers import AutoConfig File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "D:\envs\test_llama-v4\lib\site-packages\transformers__init.py", line 26, in from . import dependency_versions_check File "", line 1078, in _handle_fromlist File "", line 241, in _call_with_frames_removed File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "D:\envs\test_llama-v4\lib\site-packages\transformers\dependency_versions_check.py", line 16, in from .utils.versions import require_version, require_version_core File "", line 1027, in _find_and_load File "", line 992, in _find_and_load_unlocked File "", line 241, in _call_with_frames_removed File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "D:\envs\test_llama-v4\lib\site-packages\transformers\utils__init.py", line 27, in from .chat_template_utils import DocstringParsingException, TypeHintParsingException, get_json_schema File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "D:\envs\test_llama-v4\lib\site-packages\transformers\utils\chat_template_utils.py", line 39, in from torch import Tensor File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "D:\envs\test_llama-v4\lib\site-packages\torch\init__.py", line 1694, in _C._initExtension(_manager_path()) File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "D:\envs\test_llama-v4\lib\site-packages\torch\cuda\init.py", line 1470, in _lazy_call(_register_triton_kernels) File "D:\envs\test_llama-v4\lib\site-packages\torch\cuda\init__.py", line 256, in _lazy_call _queued_calls.append((callable, traceback.format_stack()))

hello-11 commented 4 days ago

@yspch2022 Could you use the latest version of Trt-LLM?

jayakommuru commented 4 days ago

@hello-11 I am using the latest triton container nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 and it has tensorRT-LLM version of 0.14.0, but still facing the same above issue with 0.14.0 version

yspch2022 commented 4 days ago

@yspch2022 Could you use the latest version of Trt-LLM?

@hello-11 Yes. First, I use 0.14.0 version and 0.13.0 secondly in case llama3.2-3B. But neither worked. Even the 0.14 version failed with no llama3-8B conversion similar to this link - https://github.com/NVIDIA/TensorRT-LLM/issues/2452