Nvidia GeForce RTX 4070 doesnt load llama 7b

dhruvildarji commented 10 months ago

System Info

GPU (Nvidia GeForce RTX 4070)
CPU AMD Ryzen 9 5900X 12-Core Processor
64 Gig RAM
2 Tb SSD

I downloaded huggingface llama 7b model. https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

I am trying to load it using following command

Build the LLaMA 7B model using a single GPU and FP16.

python build.py --model_dir ./tmp/llama/7B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu/

It gives following error.

[01/17/2024-22:14:09] [TRT] [W] Tactic Device request: 16512MB Available: 12281MB. Device memory is insufficient to use tactic.
[01/17/2024-22:14:09] [TRT] [W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 16512 detected for tactic 0x000000000000001d.
[01/17/2024-22:14:09] [TRT] [W] Tactic Device request: 16512MB Available: 12281MB. Device memory is insufficient to use tactic.
[01/17/2024-22:14:09] [TRT] [W] UNSUPPORTED_STATESkipping tactic 1 due to insufficient memory on requested size of 16512 detected for tactic 0x000000000000001e.
[01/17/2024-22:14:09] [TRT] [W] Tactic Device request: 16512MB Available: 12281MB. Device memory is insufficient to use tactic.
[01/17/2024-22:14:09] [TRT] [W] UNSUPPORTED_STATESkipping tactic 2 due to insufficient memory on requested size of 16512 detected for tactic 0x000000000000001f.
[01/17/2024-22:14:10] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/17/2024-22:14:10] [TRT] [I] Detected 73 inputs and 33 output network tensors.
[01/17/2024-22:14:18] [TRT] [E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[01/17/2024-22:14:18] [TRT] [E] 1: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::127] Error Code 1: Cuda Driver (invalid argument)
[01/17/2024-22:14:18] [TRT] [W] Requested amount of GPU memory (10747904000 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.

I think this happens because my GPU has only 8 Gig VRAM. Can you please suggest which model should I use? I would like to parse text data and output text data only.

Please suggest new model which I can load and use?

Or Is there any other way I can load llama 7b in my GPU 4070 RTX?

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Download llama 7b hf model https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

Build the LLaMA 7B model using a single GPU and FP16. python build.py --model_dir ./tmp/llama/7B/ \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu/

Expected behavior

It should give output with trt engine

actual behavior

It cant load the actual model 7b llama.

additional notes

NA

Tlntin commented 10 months ago

weight only int4 gptq-int4 you can try it. other model ? Qwen 1.8b may work well, it only need 4GB GPU memory.

nv-guomingz commented 1 day ago

Hi @dhruvildarji do u still have further issue or question now? If not, we'll close it soon.

NVIDIA / TensorRT-LLM