NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.64k stars 986 forks source link

Qwen2-1.5B-Instruct convert_checkpoint.py failed #2388

Open 1994 opened 2 weeks ago

1994 commented 2 weeks ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

I execute convert script in docker:

nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3

command:

python3 convert_checkpoint.py --model_dir --output_dir /tllm --dtype float16

model file:

https://huggingface.co/Qwen/Qwen2-1.5B-Instruct

exception stack:

[10/29/2024-17:08:19] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 470.182.03. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
0.13.0
229it [00:02, 93.96it/s] 
Traceback (most recent call last):
  File "/home/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 303, in <module>
    main()
  File "/home/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 295, in main
    convert_and_save_hf(args)
  File "/home/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 251, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/home/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 258, in execute
    f(args, rank)
  File "/home/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 241, in convert_and_save_rank
    qwen = QWenForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 427, in from_hugging_face
    loader.generate_tllm_weights(model)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 357, in generate_tllm_weights
    self.load(tllm_key,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 278, in load
    v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/linear.py", line 391, in postprocess
    weights = weights.to(str_dtype_to_torch(self.dtype))
AttributeError: 'NoneType' object has no attribute 'to'
Exception ignored in: <function PretrainedModel.__del__ at 0x7f778f992050>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 453, in __del__
    self.release()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 450, in release
    release_gc()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 471, in release_gc
    torch.cuda.ipc_collect()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 901, in ipc_collect
    _lazy_init()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 330, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

Expected behavior

convert success

actual behavior

convert failed

additional notes

tensorRT-llm 0.13.0

yatoooon commented 2 weeks ago

same case

nv-guomingz commented 1 week ago

Hi @1994 , you may try apply below hot fix for qwen2 1.5B model with latest code base.

Image