NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

Error: convert_checkpoint in TensorRT-LLM for Llama3.2 3B when tested on multiple versions #2471

Open DeekshithaDPrakash opened 2 days ago

DeekshithaDPrakash commented 2 days ago

System Info

GPU: A100 Ubuntu: Ubuntu 22.04.4 LTS

Command:

CONVERT_CHKPT_SCRIPT=/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py
python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
0.13.0
[11/20/2024-08:06:21] [TRT-LLM] [W] AutoConfig cannot load the huggingface config.
Traceback (most recent call last):
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 503, in <module>
    main()
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 495, in main
    convert_and_save_hf(args)
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 437, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 444, in execute
    f(args, rank)
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 423, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 320, in from_hugging_face
    config = LLaMAConfig.from_hugging_face(hf_config_or_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/config.py", line 101, in from_hugging_face
    hf_config = transformers.AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 989, in from_pretrained
    return config_class.from_dict(config_dict, **unused_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 772, in from_dict
    config = cls(**config_dict)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/configuration_llama.py", line 161, in __init__
    self._rope_scaling_validation()
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/configuration_llama.py", line 182, in _rope_scaling_validation
    raise ValueError(
ValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 32.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

Upgarded verion: 4.45.2 gives the following error,

[TensorRT-LLM] TensorRT-LLM version: 0.13.0
0.13.0
201it [00:00, 260.82it/s]
Traceback (most recent call last):
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 503, in <module>
    main()
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 495, in main
    convert_and_save_hf(args)
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 437, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 444, in execute
    f(args, rank)
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 423, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 358, in from_hugging_face
    loader.generate_tllm_weights(model)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 357, in generate_tllm_weights
    self.load(tllm_key,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 278, in load
    v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/linear.py", line 391, in postprocess
    weights = weights.to(str_dtype_to_torch(self.dtype))
AttributeError: 'NoneType' object has no attribute 'to'
Exception ignored in: <function PretrainedModel.__del__ at 0x7fec26679870>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 453, in __del__
    self.release()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 450, in release
    release_gc()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 471, in release_gc
    torch.cuda.ipc_collect()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 901, in ipc_collect
    _lazy_init()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 330, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

CUDA call was originally invoked at:

  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 8, in <module>
    from transformers import AutoConfig
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/__init__.py", line 26, in <module>
    from . import dependency_versions_check
  File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/dependency_versions_check.py", line 16, in <module>
    from .utils.versions import require_version, require_version_core
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/__init__.py", line 27, in <module>
    from .chat_template_utils import DocstringParsingException, TypeHintParsingException, get_json_schema
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/chat_template_utils.py", line 39, in <module>
    from torch import Tensor
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/__init__.py", line 1689, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1467, in <module>
    _lazy_call(_register_triton_kernels)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 253, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

The same error comes with other TensorRT-LLM versions like 0.14 for tritonserver:24.08-trtllm-python-py3, tritonserver:24.09-trtllm-python-py3, as well as tritonserver:24.10-trtllm-python-py3

I'm now assuming that this is a bug! as there are multiple users facing the same issue : #2467 #2339 #2320

Who can help?

No response

Information

Tasks

Reproduction

Steps to reproduce the behavior:

  1. sudo docker run -it --net host --shm-size=4g --name triton_llm_llama32 --ulimit memlock=-1 --ulimit stack=67108864 --gpus '"device=1"' -v /local_mnt_folder/:/opt/tritonserver/TensorRT_LLM nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3
  2. apt-get update apt-get install -y openmpi-bin openmpi-common libopenmpi-dev git git-lfs
  3. cd TensorRT_LLM
  4. git clone -b v0.13.0 https://github.com/triton-inference-server/tensorrtllm_backend.git git clone --branch v0.13.0 https://github.com/NVIDIA/TensorRT-LLM.git
  5. CONVERT_CHKPT_SCRIPT=/opt/tritonserver/TensorRT_LLM/TensorRT-LLM/examples/llama/convert_checkpoint.py LLAMA_MODEL=/opt/tritonserver/TensorRT_LLM/llama3_2_model UNIFIED_CKPT_PATH=/opt/tritonserver/TensorRT_LLM/ckpt/llaam32/3b ENGINE_DIR=/opt/tritonserver/TensorRT_LLM/engines/1-gpu
  6. python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16

Expected behavior

The convert_checkpoint runs smoothly and creates two files inside the checkpoint folder:

  1. config.json
  2. rank0.safetensors

actual behavior

Lower transformer version gives rope_scaling error, while a higher version(>=4.45.1) as described by llama3.2 gives CUDA error: torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

additional notes

I think there's surely version mismatch

jayakommuru commented 2 days ago

+1 Facing the same issue

jayakommuru commented 2 days ago

HI @byshiue can you help with this?

byshiue commented 1 day ago

For stable branch, llama 3.2 is supported since release 0.15.

If you want to run test now, you need to use main branch to deploy.

DeekshithaDPrakash commented 1 day ago

For stable branch, llama 3.2 is supported since release 0.15.

If you want to run test now, you need to use main branch to deploy.

@byshiue Thank you for your response. I truly appreciate your guidance.

I will test the following and update here soon.

  1. https://github.com/triton-inference-server/tensorrtllm_backend.git
  2. tritonserver:24.10-trtllm-python-py3
  3. https://github.com/NVIDIA/TensorRT-LLM.git
DeekshithaDPrakash commented 5 hours ago

I tested with stable branches of both TensorRT-LLM as well as tensorrtllm-backend with varied tensorrt-llm repos:

The error still remains when convert_checkpoint.py is executed.

Command:

python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16

Upgraded tensorrt-llm versions within the docker container using: pip install -U tensorrt-llm==version_no

  1. tensort-llm version 0.14.0

Error:

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
Traceback (most recent call last):
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 16, in <module>
    from tensorrt_llm.models.convert_utils import infer_dtype
ImportError: cannot import name 'infer_dtype' from 'tensorrt_llm.models.convert_utils' (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/convert_utils.py)
  1. tensort-llm version 0.15.0.dev2024101500

Error:

Traceback (most recent call last):
  File "/opt/tritonserver/TensorRT_LLM_KARI/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 10, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 28, in <module>
    from . import graph_rewriting as gw
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/graph_rewriting.py", line 12, in <module>
    from .network import Network
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/network.py", line 27, in <module>
    from tensorrt_llm.module import Module
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 17, in <module>
    from ._common import default_net
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 37, in <module>
    from ._utils import str_dtype_to_trt
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 31, in <module>
    from tensorrt_llm.bindings import GptJsonConfig
ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs
  1. tensorrt-llm version 0.15.0.dev2024102900

Error:

[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024102900
Traceback (most recent call last):
  File "/opt/tritonserver/TensorRT_LLM/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 16, in <module>
    from tensorrt_llm.models.convert_utils import infer_dtype
ImportError: cannot import name 'infer_dtype' from 'tensorrt_llm.models.convert_utils' (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/convert_utils.py)

The clear solution is not yet found!!!