Closed robmsmt closed 4 months ago
Hi @robmsmt , in the log you shared, there are such lines
RuntimeError: CUDA ERROR: 2, error code reference: https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t
According to https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t following the message, there are following information
cudaErrorMemoryAllocation = 2 The API call failed because it was unable to allocate enough memory or other resources to perform the requested operation.
So, I think this is an OOM issue, there are no enough memory for activation. If you want to know more details on the memory usage of that engine, please take a look at the build log by setting log level to info
.
Hope that helps, thank you.
@robmsmt Any reason you are passing -m llama_7b
in the benchmark.py for mistral engine?
@robmsmt Any reason you are passing
-m llama_7b
in the benchmark.py for mistral engine?
python3 ../../benchmarks/python/benchmark.py -m mistral_7b --mode plugin --batch_size "1" --input_output_len "1,1"
benchmark.py: error: argument -m/--model: invalid choice: 'mistral_7b' (choose from 'bloom_176b', 'flan_t5_small', 'bloom_560m', 'bert_large', 'flan_t5_xl', 'opt_6.7b', 'gpt_next_2b', 'llama_7b', 'gptj_6b', 'opt_66b', 'mamba_370m', 'mamba_2.8b', 'flan_t5_large', 'baichuan2_13b_chat', 'chatglm_6b', 'falcon_rw_1b', 'internlm_chat_7b', 'gpt_350m_moe', 'mbart_large_50_many_to_one_mmt', 'flan_t5_base', 'mixtral_8x7b', 'chatglm2_6b', 'opt_350m', 'falcon_40b', 'gptneox_20b', 't5_3b', 'bert_base', 'falcon_180b', 'llama_13b', 'baichuan_7b', 'baichuan_13b_chat', 'qwen_7b_chat', 'mamba_790m', 'internlm_chat_20b', 'mamba_130m', 'roberta_base', 'llama_30b', 'opt_2.7b', 'qwen_14b_chat', 'whisper_large_v3', 'gpt_350m', 'bart_large_cnn', 'llama_70b_long_context', 'llama_70b_long_generation', 'gpt_1.5b', 'falcon_7b', 't5_11b', 't5_base', 'flan_t5_xxl', 'llama_70b', 'mamba_1.4b', 'gpt_350m_sq_per_token_channel', 'chatglm3_6b', 't5_large', 't5_small', 'baichuan2_7b_chat', 'gpt_175b', 'gpt_350m_sq_per_tensor', 'llama_70b_sq_per_tensor')
There is no mistral benchmark. I just used mistral as an example but llama2 is the same error (however slightly harder to reproduce since you need to get approval for the model from meta on hf)
Hi @robmsmt , in the log you shared, there are such lines
RuntimeError: CUDA ERROR: 2, error code reference: https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t
According to https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t following the message, there are following information
cudaErrorMemoryAllocation = 2 The API call failed because it was unable to allocate enough memory or other resources to perform the requested operation.
So, I think this is an OOM issue, there are no enough memory for activation. If you want to know more details on the memory usage of that engine, please take a look at the build log by setting log level to
info
.Hope that helps, thank you.
Thanks for the reply @kaiyux, for a 7B model (llama2 or mistral) 24GB VRAM should be enough GPU memory for 7B model at FP16. Do you agree?
Also why can it run fine with:
python3 ../run.py --max_output_len=20 \
--tokenizer_dir=./mistralai/Mistral-7B-v0.1/ \
--engine_dir=./mistralai/Mistral-7B-v0.1/trt_engines/fp16/1-gpu/ \
--max_attention_window_size=4096
this issue no longer exists on 0.10+
System Info
Using defaults from repo:
Note that the nodepool driver is automatically installed with:
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
mistral uses llama directory
cd examples/llama rm -rf ./mistralai/Mistral-7B-v0.1 mkdir -p ./mistralai/Mistral-7B-v0.1 && git clone https://huggingface.co/mistralai/Mistral-7B-v0.1 ./mistralai/Mistral-7B-v0.1
python3 convert_checkpoint.py --model_dir ./mistralai/Mistral-7B-v0.1/ \ --dtype float16 \ --output_dir ./mistralai/Mistral-7B-v0.1/trt_ckpt/fp16/1-gpu/
trtllm-build --checkpoint_dir ./mistralai/Mistral-7B-v0.1/trt_ckpt/fp16/1-gpu/ \ --gemm_plugin float16 \ --output_dir ./mistralai/Mistral-7B-v0.1/trt_engines/fp16/1-gpu/ \ --max_input_len 32256
WORKS
python3 ../run.py --max_output_len=20 \ --tokenizer_dir=./mistralai/Mistral-7B-v0.1/ \ --engine_dir=./mistralai/Mistral-7B-v0.1/trt_engines/fp16/1-gpu/ \ --max_attention_window_size=4096
WORKS
python3 ../summarize.py --test_trt_llm \ --hf_model_dir ./mistralai/Mistral-7B-v0.1/ \ --data_type fp16 \ --engine_dir ./mistralai/Mistral-7B-v0.1/trt_engines/fp16/1-gpu/
root@l4-trt-llm-rob-model-669fd852rzqk:/TensorRT-LLM/examples/llama# python3 ../../benchmarks/python/benchmark.py -m llama_7b --mode plugin --batch_size "1" --input_output_len "1,1" [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600Traceback (most recent call last): File "/TensorRT-LLM/examples/llama/../../benchmarks/python/benchmark.py", line 405, in
main(args)
File "/TensorRT-LLM/examples/llama/../../benchmarks/python/benchmark.py", line 299, in main
benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
File "/TensorRT-LLM/benchmarks/python/gpt_benchmark.py", line 166, in init
self.decoder = tensorrt_llm.runtime.GenerationSession(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 492, in init
self.runtime = _Runtime(engine_buffer, mapping)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 155, in init
self.prepare(mapping, engine_buffer)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 178, in prepare
address = CUASSERT(cudart.cudaMalloc(self.engine.device_memory_size))[0]
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 104, in CUASSERT
raise RuntimeError(
RuntimeError: CUDA ERROR: 2, error code reference: https://nvidia.github.io/cuda-python/module/cudart.html#cuda.cudart.cudaError_t
Exception ignored in: <function _Runtime.del at 0x7a1f697c0af0>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 282, in del
cudart.cudaFree(self.address) # FIXME: cudaFree is None??
AttributeError: '_Runtime' object has no attribute 'address'