NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.68k stars 991 forks source link

L4 Llama2/Mistral gets CUDA error when benchmarking (not when running) #1187

Closed robmsmt closed 4 months ago

robmsmt commented 8 months ago

System Info

Using defaults from repo:

nvidia-ammo==0.7.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
tensorrt==9.2.0.post12.dev5
tensorrt-bindings==9.2.0.post12.dev5
tensorrt-libs==9.2.0.post12.dev5
tensorrt-llm==0.9.0.dev2024020600

Note that the nodepool driver is automatically installed with:

gcloud container node-pools create "node-pool-l4" \
--accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
--machine-type g2-standard-32 \
--disk-type pd-balanced \
--disk-size 300GB \
...

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

- OS (Ubuntu 22.04, CentOS 7, Windows 10): ubuntu22.04
- Any other information that may be useful in reproducing the bug

### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Follow the main steps to install steps on: https://github.com/NVIDIA/TensorRT-LLM/tree/main?tab=readme-ov-file#installation

Running / Setup:
Note- here I will use mistral as you can easily get the model from hf-hub than llama but the results are the same.

mistral uses llama directory

cd examples/llama rm -rf ./mistralai/Mistral-7B-v0.1 mkdir -p ./mistralai/Mistral-7B-v0.1 && git clone https://huggingface.co/mistralai/Mistral-7B-v0.1 ./mistralai/Mistral-7B-v0.1

python3 convert_checkpoint.py --model_dir ./mistralai/Mistral-7B-v0.1/ \ --dtype float16 \ --output_dir ./mistralai/Mistral-7B-v0.1/trt_ckpt/fp16/1-gpu/

trtllm-build --checkpoint_dir ./mistralai/Mistral-7B-v0.1/trt_ckpt/fp16/1-gpu/ \ --gemm_plugin float16 \ --output_dir ./mistralai/Mistral-7B-v0.1/trt_engines/fp16/1-gpu/ \ --max_input_len 32256


Testing:

WORKS

python3 ../run.py --max_output_len=20 \ --tokenizer_dir=./mistralai/Mistral-7B-v0.1/ \ --engine_dir=./mistralai/Mistral-7B-v0.1/trt_engines/fp16/1-gpu/ \ --max_attention_window_size=4096

WORKS

python3 ../summarize.py --test_trt_llm \ --hf_model_dir ./mistralai/Mistral-7B-v0.1/ \ --data_type fp16 \ --engine_dir ./mistralai/Mistral-7B-v0.1/trt_engines/fp16/1-gpu/


This fails:
`python3 ../../benchmarks/python/benchmark.py     -m llama_7b     --mode plugin     --batch_size "1"     --input_output_len "1,1"`

### Expected behavior

Benchmark llama2/mistral should not crash

### actual behavior

both llama2-7b and mistral 7b are small enough that at FP16 there should be 10GB of VRAM left for model. It doesn't make sense that it can run but not benchmark

root@l4-trt-llm-rob-model-669fd852rzqk:/TensorRT-LLM/examples/llama# python3 ../../benchmarks/python/benchmark.py -m llama_7b --mode plugin --batch_size "1" --input_output_len "1,1" [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600Traceback (most recent call last): File "/TensorRT-LLM/examples/llama/../../benchmarks/python/benchmark.py", line 405, in main(args) File "/TensorRT-LLM/examples/llama/../../benchmarks/python/benchmark.py", line 299, in main benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options, File "/TensorRT-LLM/benchmarks/python/gpt_benchmark.py", line 166, in init self.decoder = tensorrt_llm.runtime.GenerationSession( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 492, in init self.runtime = _Runtime(engine_buffer, mapping) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 155, in init self.prepare(mapping, engine_buffer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 178, in prepare address = CUASSERT(cudart.cudaMalloc(self.engine.device_memory_size))[0] File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 104, in CUASSERT raise RuntimeError( RuntimeError: CUDA ERROR: 2, error code reference: https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t Exception ignored in: <function _Runtime.del at 0x7a1f697c0af0> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 282, in del cudart.cudaFree(self.address) # FIXME: cudaFree is None?? AttributeError: '_Runtime' object has no attribute 'address'



### additional notes

Running LLama2 / mistral fails using the L4 but the Bloom model does not.
kaiyux commented 8 months ago

Hi @robmsmt , in the log you shared, there are such lines

RuntimeError: CUDA ERROR: 2, error code reference: https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t

According to https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t following the message, there are following information

cudaErrorMemoryAllocation = 2 The API call failed because it was unable to allocate enough memory or other resources to perform the requested operation.

So, I think this is an OOM issue, there are no enough memory for activation. If you want to know more details on the memory usage of that engine, please take a look at the build log by setting log level to info.

Hope that helps, thank you.

arunraman commented 8 months ago

@robmsmt Any reason you are passing -m llama_7b in the benchmark.py for mistral engine?

robmsmt commented 8 months ago

@robmsmt Any reason you are passing -m llama_7b in the benchmark.py for mistral engine?

python3 ../../benchmarks/python/benchmark.py -m mistral_7b --mode plugin --batch_size "1" --input_output_len "1,1"

benchmark.py: error: argument -m/--model: invalid choice: 'mistral_7b' (choose from 'bloom_176b', 'flan_t5_small', 'bloom_560m', 'bert_large', 'flan_t5_xl', 'opt_6.7b', 'gpt_next_2b', 'llama_7b', 'gptj_6b', 'opt_66b', 'mamba_370m', 'mamba_2.8b', 'flan_t5_large', 'baichuan2_13b_chat', 'chatglm_6b', 'falcon_rw_1b', 'internlm_chat_7b', 'gpt_350m_moe', 'mbart_large_50_many_to_one_mmt', 'flan_t5_base', 'mixtral_8x7b', 'chatglm2_6b', 'opt_350m', 'falcon_40b', 'gptneox_20b', 't5_3b', 'bert_base', 'falcon_180b', 'llama_13b', 'baichuan_7b', 'baichuan_13b_chat', 'qwen_7b_chat', 'mamba_790m', 'internlm_chat_20b', 'mamba_130m', 'roberta_base', 'llama_30b', 'opt_2.7b', 'qwen_14b_chat', 'whisper_large_v3', 'gpt_350m', 'bart_large_cnn', 'llama_70b_long_context', 'llama_70b_long_generation', 'gpt_1.5b', 'falcon_7b', 't5_11b', 't5_base', 'flan_t5_xxl', 'llama_70b', 'mamba_1.4b', 'gpt_350m_sq_per_token_channel', 'chatglm3_6b', 't5_large', 't5_small', 'baichuan2_7b_chat', 'gpt_175b', 'gpt_350m_sq_per_tensor', 'llama_70b_sq_per_tensor')

There is no mistral benchmark. I just used mistral as an example but llama2 is the same error (however slightly harder to reproduce since you need to get approval for the model from meta on hf)

robmsmt commented 8 months ago

Hi @robmsmt , in the log you shared, there are such lines

RuntimeError: CUDA ERROR: 2, error code reference: https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t

According to https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t following the message, there are following information

cudaErrorMemoryAllocation = 2 The API call failed because it was unable to allocate enough memory or other resources to perform the requested operation.

So, I think this is an OOM issue, there are no enough memory for activation. If you want to know more details on the memory usage of that engine, please take a look at the build log by setting log level to info.

Hope that helps, thank you.

Thanks for the reply @kaiyux, for a 7B model (llama2 or mistral) 24GB VRAM should be enough GPU memory for 7B model at FP16. Do you agree?

Also why can it run fine with:

python3 ../run.py --max_output_len=20 \
               --tokenizer_dir=./mistralai/Mistral-7B-v0.1/ \
               --engine_dir=./mistralai/Mistral-7B-v0.1/trt_engines/fp16/1-gpu/ \
               --max_attention_window_size=4096
robmsmt commented 4 months ago

this issue no longer exists on 0.10+