Open TobyGE opened 5 months ago
Please follow the issue template to provide your environment and reproduced steps. Thank you for cooperation.
getting same errror for Gemma model
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Traceback (most recent call last):
File "/userhome/home/sagdesai/work/gemma-trt/TensorRT-LLM/examples/gemma/../run.py", line 602, in <module>
main(args)
File "/userhome/home/sagdesai/work/gemma-trt/TensorRT-LLM/examples/gemma/../run.py", line 451, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 163, in from_dir
executor = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:118)
https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gemma#run-inference-under-bfloat16-for-hf-checkpoint
requirement.txt
for Gemma
python3 ./convert_checkpoint.py \
--ckpt-type hf \
--model-dir /sadata/models_hf/gemma-2b-it/ \
--dtype bfloat16 \
--world-size 1 \
--output-model-dir /userhome/home/sagdesai/work/gemma-trt
trtllm-build --checkpoint_dir /userhome/home/***/work/gemma-trt \
--gemm_plugin bfloat16 \
--gpt_attention_plugin bfloat16 \
--max_batch_size 8 \
--max_input_len 3000 \
--max_output_len 100 \
--output_dir /userhome/home/***/work/gemma-trt/engin_dir
python3 ../run.py --engine_dir /userhome/home/sagdesai/work/gemma-trt/engin_dir \ 1 ↵
--max_output_len 30 \
--max_attention_window_size 100 \
--vocab_file /sadata/models_hf/gemma-2b-it/tokenizer.model
error
TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Traceback (most recent call last):
File "/userhome/home/sagdesai/work/gemma-trt/TensorRT-LLM/examples/gemma/../run.py", line 602, in <module>
main(args)
File "/userhome/home/sagdesai/work/gemma-trt/TensorRT-LLM/examples/gemma/../run.py", line 451, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 163, in from_dir
executor = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:118)
1 0x7f931be4e69e void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 94
2 0x7f931d88862c tensorrt_llm::runtime::BufferManager::gpu(unsigned long, nvinfer1::DataType) const + 236
3 0x7f931d94a4ec tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 540
4 0x7f931db2e426 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1366
5 0x7f931daefea2 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1058
6 0x7f931db5a7d7 tensorrt_llm::executor::Executor::Impl::createModel(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 807
7 0x7f931db5b1c1 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2113
8 0x7f931db513d2 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 50
9 0x7f939b4e2ee2 /userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xadee2) [0x7f939b4e2ee2]
10 0x7f939b48b26c /userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5626c) [0x7f939b48b26c]
11 0x55b34dc3310e python3(+0x15a10e) [0x55b34dc3310e]
12 0x55b34dc29a7b _PyObject_MakeTpCall + 603
13 0x55b34dc41c20 python3(+0x168c20) [0x55b34dc41c20]
14 0x55b34dc3e087 python3(+0x165087) [0x55b34dc3e087]
15 0x55b34dc29e2b python3(+0x150e2b) [0x55b34dc29e2b]
16 0x7f939b48a88b /userhome/home/sagdesai/.del-venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5588b) [0x7f939b48a88b]
17 0x55b34dc29a7b _PyObject_MakeTpCall + 603
18 0x55b34dc22629 _PyEval_EvalFrameDefault + 27257
19 0x55b34dc417f1 python3(+0x1687f1) [0x55b34dc417f1]
20 0x55b34dc42492 PyObject_Call + 290
21 0x55b34dc1e5d7 _PyEval_EvalFrameDefault + 10791
22 0x55b34dc339fc _PyFunction_Vectorcall + 124
23 0x55b34dc1c26d _PyEval_EvalFrameDefault + 1725
24 0x55b34dc189c6 python3(+0x13f9c6) [0x55b34dc189c6]
25 0x55b34dd0e256 PyEval_EvalCode + 134
26 0x55b34dd39108 python3(+0x260108) [0x55b34dd39108]
27 0x55b34dd329cb python3(+0x2599cb) [0x55b34dd329cb]
28 0x55b34dd38e55 python3(+0x25fe55) [0x55b34dd38e55]
29 0x55b34dd38338 _PyRun_SimpleFileObject + 424
30 0x55b34dd37f83 _PyRun_AnyFileObject + 67
31 0x55b34dd2aa5e Py_RunMain + 702
32 0x55b34dd0102d Py_BytesMain + 45
33 0x7f9553db6d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f9553db6d90]
34 0x7f9553db6e40 __libc_start_main + 128
35 0x55b34dd00f25 _start + 37
*** The MPI_Comm_free() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[exp-blr-dgxa100-04.expblr.dc:2894239] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
working on DGX A100, multiple GPUs are free
python --version 1 ↵ Python 3.10.12
Could you take a try for these two cases:
run.py
with CUDA_VISIBLE_DEVICES=0
.run.py
with --use_py_session
.
the instruction code for mpt-7b works fine when using older version 20240123, but when updating to the latest branch, using the new code, always have OOM error with multiple gpus, even when using 8*A100 40G