TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
[01/17/2024-22:14:09] [TRT] [W] Tactic Device request: 16512MB Available: 12281MB. Device memory is insufficient to use tactic.
[01/17/2024-22:14:09] [TRT] [W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 16512 detected for tactic 0x000000000000001d.
[01/17/2024-22:14:09] [TRT] [W] Tactic Device request: 16512MB Available: 12281MB. Device memory is insufficient to use tactic.
[01/17/2024-22:14:09] [TRT] [W] UNSUPPORTED_STATESkipping tactic 1 due to insufficient memory on requested size of 16512 detected for tactic 0x000000000000001e.
[01/17/2024-22:14:09] [TRT] [W] Tactic Device request: 16512MB Available: 12281MB. Device memory is insufficient to use tactic.
[01/17/2024-22:14:09] [TRT] [W] UNSUPPORTED_STATESkipping tactic 2 due to insufficient memory on requested size of 16512 detected for tactic 0x000000000000001f.
[01/17/2024-22:14:10] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/17/2024-22:14:10] [TRT] [I] Detected 73 inputs and 33 output network tensors.
[01/17/2024-22:14:18] [TRT] [E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[01/17/2024-22:14:18] [TRT] [E] 1: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::127] Error Code 1: Cuda Driver (invalid argument)
[01/17/2024-22:14:18] [TRT] [W] Requested amount of GPU memory (10747904000 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
I think this happens because my GPU has only 8 Gig VRAM.
Can you please suggest which model should I use?
I would like to parse text data and output text data only.
Please suggest new model which I can load and use?
Or Is there any other way I can load llama 7b in my GPU 4070 RTX?
Who can help?
No response
Information
[x] The official example scripts
[ ] My own modified scripts
Tasks
[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
System Info
I downloaded huggingface llama 7b model. https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
I am trying to load it using following command
Build the LLaMA 7B model using a single GPU and FP16.
It gives following error.
I think this happens because my GPU has only 8 Gig VRAM. Can you please suggest which model should I use? I would like to parse text data and output text data only.
Please suggest new model which I can load and use?
Or Is there any other way I can load llama 7b in my GPU 4070 RTX?
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Download llama 7b hf model https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
Build the LLaMA 7B model using a single GPU and FP16. python build.py --model_dir ./tmp/llama/7B/ \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu/
Expected behavior
It should give output with trt engine
actual behavior
It cant load the actual model 7b llama.
additional notes
NA