NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.67k stars 990 forks source link

Nvidia GeForce RTX 4070 doesnt load llama 7b #910

Open dhruvildarji opened 10 months ago

dhruvildarji commented 10 months ago

System Info

I downloaded huggingface llama 7b model. https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

I am trying to load it using following command

Build the LLaMA 7B model using a single GPU and FP16.

python build.py --model_dir ./tmp/llama/7B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu/

It gives following error.

[01/17/2024-22:14:09] [TRT] [W] Tactic Device request: 16512MB Available: 12281MB. Device memory is insufficient to use tactic.
[01/17/2024-22:14:09] [TRT] [W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 16512 detected for tactic 0x000000000000001d.
[01/17/2024-22:14:09] [TRT] [W] Tactic Device request: 16512MB Available: 12281MB. Device memory is insufficient to use tactic.
[01/17/2024-22:14:09] [TRT] [W] UNSUPPORTED_STATESkipping tactic 1 due to insufficient memory on requested size of 16512 detected for tactic 0x000000000000001e.
[01/17/2024-22:14:09] [TRT] [W] Tactic Device request: 16512MB Available: 12281MB. Device memory is insufficient to use tactic.
[01/17/2024-22:14:09] [TRT] [W] UNSUPPORTED_STATESkipping tactic 2 due to insufficient memory on requested size of 16512 detected for tactic 0x000000000000001f.
[01/17/2024-22:14:10] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/17/2024-22:14:10] [TRT] [I] Detected 73 inputs and 33 output network tensors.
[01/17/2024-22:14:18] [TRT] [E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[01/17/2024-22:14:18] [TRT] [E] 1: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::127] Error Code 1: Cuda Driver (invalid argument)
[01/17/2024-22:14:18] [TRT] [W] Requested amount of GPU memory (10747904000 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.

I think this happens because my GPU has only 8 Gig VRAM. Can you please suggest which model should I use? I would like to parse text data and output text data only.

Please suggest new model which I can load and use?

Or Is there any other way I can load llama 7b in my GPU 4070 RTX?

Who can help?

No response

Information

Tasks

Reproduction

Download llama 7b hf model https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

Build the LLaMA 7B model using a single GPU and FP16. python build.py --model_dir ./tmp/llama/7B/ \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu/

Expected behavior

It should give output with trt engine

actual behavior

It cant load the actual model 7b llama.

additional notes

NA

Tlntin commented 10 months ago

weight only int4 gptq-int4 you can try it. other model ? Qwen 1.8b may work well, it only need 4GB GPU memory.

nv-guomingz commented 1 day ago

Hi @dhruvildarji do u still have further issue or question now? If not, we'll close it soon.