CUDA out of memory Error - decrease batch size (?)

Hi! I am trying to run the inference.py script with the "testing" inputs, but I'm having memory issues. I'm running it locally, on a computer with GPU NVIDIA GeForce RTX 3060 (max 8192MiB). I'm using the llama2-7B model. When I run python inference.py -i ./testing/input/ -o ./testing/output/ I get the following messages

[nltk_data] Downloading package punkt to /home/manuela/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/manuela/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.57s/it]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Loading BioSent2Vec model
model successfully loaded
start phenogpt
CUDA out of memory. Tried to allocate 500.00 MiB. GPU 
Cannot produce results for sample2
CUDA out of memory. Tried to allocate 172.00 MiB. GPU 
Cannot produce results for sample1

Is my GPU memory not enough to run this locally? Is there any way you can help me?

Thanks in advance and congratulations on the amazing paper and results!

WGLab / PhenoGPT

CUDA out of memory Error - decrease batch size (?) #9