NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.15k stars 899 forks source link

[0.11.0] Model Conversion OOMed in P4D.24xl #2093

Open lanking520 opened 1 month ago

lanking520 commented 1 month ago

System Info

A100 40GB x8, Ubuntu 22.04

Who can help?

No response

Information

Tasks

Reproduction

python3 llama/convert_checkpoint.py --model_dir meta-llama/Meta-Llama-3-70B-Instruct --dtype float16 --output_dir /tmp/trtllm_llama_ckpt/ --tp_size 8 --pp_size 1 --workers 8

Tried to use 8 workers to compile the model for parallism. However, got OOMed issue with TRTLLM

Expected behavior

Model is able to run conversion steps without getting OOMed

actual behavior

INFO  LmiUtils convert_py: Loading checkpoint shards: 100%|??????????| 30/30 [00:42<00:00,  1.22s/it]
INFO  LmiUtils convert_py: Loading checkpoint shards: 100%|??????????| 30/30 [00:42<00:00,  1.42s/it]
INFO  LmiUtils convert_py: Traceback (most recent call last):
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 409, in execute
INFO  LmiUtils convert_py:     future.result()
INFO  LmiUtils convert_py:   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
INFO  LmiUtils convert_py:     return self.__get_result()
INFO  LmiUtils convert_py:   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
INFO  LmiUtils convert_py:     raise self._exception
INFO  LmiUtils convert_py:   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
INFO  LmiUtils convert_py:     result = self.fn(*self.args, **self.kwargs)
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 367, in convert_and_save_rank
INFO  LmiUtils convert_py:     llama = LLaMAForCausalLM.from_hugging_face(
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 317, in from_hugging_face
INFO  LmiUtils convert_py:     weights = load_weights_from_hf_model(hf_model, config)
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1128, in load_weights_from_hf_model
INFO  LmiUtils convert_py:     convert_layer(l)
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1006, in convert_layer
INFO  LmiUtils convert_py:     mlp_gate_weight = get_weight(model_params, prefix + 'mlp.up_proj',
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 431, in get_weight
INFO  LmiUtils convert_py:     config[prefix + '.weight'].data = config[prefix + '.weight'].to(dtype)
INFO  LmiUtils convert_py: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU
INFO  LmiUtils convert_py: Traceback (most recent call last):
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 409, in execute
INFO  LmiUtils convert_py:     future.result()
INFO  LmiUtils convert_py:   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
INFO  LmiUtils convert_py:     return self.__get_result()
INFO  LmiUtils convert_py:   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
INFO  LmiUtils convert_py:     raise self._exception
INFO  LmiUtils convert_py:   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
INFO  LmiUtils convert_py:     result = self.fn(*self.args, **self.kwargs)
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 367, in convert_and_save_rank
INFO  LmiUtils convert_py:     llama = LLaMAForCausalLM.from_hugging_face(
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 317, in from_hugging_face
INFO  LmiUtils convert_py:     weights = load_weights_from_hf_model(hf_model, config)
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1128, in load_weights_from_hf_model
INFO  LmiUtils convert_py:     convert_layer(l)
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1006, in convert_layer
INFO  LmiUtils convert_py:     mlp_gate_weight = get_weight(model_params, prefix + 'mlp.up_proj',
INFO  LmiUtils convert_py:   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 431, in get_weight
INFO  LmiUtils convert_py:     config[prefix + '.weight'].data = config[prefix + '.weight'].to(dtype)
INFO  LmiUtils convert_py: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU

additional notes

Changing workers number to 1 could mitigate the issue. But very slow

Kefeng-Duan commented 3 weeks ago

@lanking520 please enable load_by_shard when you set 8 works, or each work load full weights