NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.41k stars 800 forks source link

Out of Memory error after updating to the latest branch #1060

Open TobyGE opened 5 months ago

TobyGE commented 5 months ago

the instruction code for mpt-7b works fine when using older version 20240123, but when updating to the latest branch, using the new code, always have OOM error with multiple gpus, even when using 8*A100 40G

Traceback (most recent call last):
  File "/TensorRT-LLM/examples/mpt/../run.py", line 504, in <module>
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600    main(args)
  File "/TensorRT-LLM/examples/mpt/../run.py", line 379, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 169, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: out of memory (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/ipcUtils.cpp:48)
byshiue commented 4 months ago

Please follow the issue template to provide your environment and reproduced steps. Thank you for cooperation.

SDcodehub commented 1 month ago

getting same errror for Gemma model

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Traceback (most recent call last):
  File "/userhome/home/sagdesai/work/gemma-trt/TensorRT-LLM/examples/gemma/../run.py", line 602, in <module>
    main(args)
  File "/userhome/home/sagdesai/work/gemma-trt/TensorRT-LLM/examples/gemma/../run.py", line 451, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 163, in from_dir
    executor = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:118)
SDcodehub commented 1 month ago
  1. cloned https://github.com/NVIDIA/TensorRT-LLM.git - following - https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gemma#run-inference-under-bfloat16-for-hf-checkpoint
  2. installed requirement.txt for Gemma
  3. python3 ./convert_checkpoint.py \
    --ckpt-type hf \
    --model-dir  /sadata/models_hf/gemma-2b-it/ \
    --dtype bfloat16 \
    --world-size 1 \
    --output-model-dir /userhome/home/sagdesai/work/gemma-trt
trtllm-build --checkpoint_dir /userhome/home/***/work/gemma-trt \
             --gemm_plugin bfloat16 \
             --gpt_attention_plugin bfloat16 \
             --max_batch_size 8 \
             --max_input_len 3000 \
             --max_output_len 100 \
             --output_dir /userhome/home/***/work/gemma-trt/engin_dir
python3 ../run.py --engine_dir /userhome/home/sagdesai/work/gemma-trt/engin_dir \                                                                                                   1 ↵
                  --max_output_len 30 \
                  --max_attention_window_size 100 \
                  --vocab_file /sadata/models_hf/gemma-2b-it/tokenizer.model

error

TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Traceback (most recent call last):
  File "/userhome/home/sagdesai/work/gemma-trt/TensorRT-LLM/examples/gemma/../run.py", line 602, in <module>
    main(args)
  File "/userhome/home/sagdesai/work/gemma-trt/TensorRT-LLM/examples/gemma/../run.py", line 451, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 163, in from_dir
    executor = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:118)
1       0x7f931be4e69e void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 94
2       0x7f931d88862c tensorrt_llm::runtime::BufferManager::gpu(unsigned long, nvinfer1::DataType) const + 236
3       0x7f931d94a4ec tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 540
4       0x7f931db2e426 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1366
5       0x7f931daefea2 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1058
6       0x7f931db5a7d7 tensorrt_llm::executor::Executor::Impl::createModel(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 807
7       0x7f931db5b1c1 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2113
8       0x7f931db513d2 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 50
9       0x7f939b4e2ee2 /userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xadee2) [0x7f939b4e2ee2]
10      0x7f939b48b26c /userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5626c) [0x7f939b48b26c]
11      0x55b34dc3310e python3(+0x15a10e) [0x55b34dc3310e]
12      0x55b34dc29a7b _PyObject_MakeTpCall + 603
13      0x55b34dc41c20 python3(+0x168c20) [0x55b34dc41c20]
14      0x55b34dc3e087 python3(+0x165087) [0x55b34dc3e087]
15      0x55b34dc29e2b python3(+0x150e2b) [0x55b34dc29e2b]
16      0x7f939b48a88b /userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5588b) [0x7f939b48a88b]
17      0x55b34dc29a7b _PyObject_MakeTpCall + 603
18      0x55b34dc22629 _PyEval_EvalFrameDefault + 27257
19      0x55b34dc417f1 python3(+0x1687f1) [0x55b34dc417f1]
20      0x55b34dc42492 PyObject_Call + 290
21      0x55b34dc1e5d7 _PyEval_EvalFrameDefault + 10791
22      0x55b34dc339fc _PyFunction_Vectorcall + 124
23      0x55b34dc1c26d _PyEval_EvalFrameDefault + 1725
24      0x55b34dc189c6 python3(+0x13f9c6) [0x55b34dc189c6]
25      0x55b34dd0e256 PyEval_EvalCode + 134
26      0x55b34dd39108 python3(+0x260108) [0x55b34dd39108]
27      0x55b34dd329cb python3(+0x2599cb) [0x55b34dd329cb]
28      0x55b34dd38e55 python3(+0x25fe55) [0x55b34dd38e55]
29      0x55b34dd38338 _PyRun_SimpleFileObject + 424
30      0x55b34dd37f83 _PyRun_AnyFileObject + 67
31      0x55b34dd2aa5e Py_RunMain + 702
32      0x55b34dd0102d Py_BytesMain + 45
33      0x7f9553db6d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f9553db6d90]
34      0x7f9553db6e40 __libc_start_main + 128
35      0x55b34dd00f25 _start + 37
*** The MPI_Comm_free() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[exp-blr-dgxa100-04.expblr.dc:2894239] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

working on DGX A100, multiple GPUs are free


python --version 1 ↵ Python 3.10.12

byshiue commented 1 month ago

Could you take a try for these two cases:

  1. run run.py with CUDA_VISIBLE_DEVICES=0.
  2. run run.py with --use_py_session.