Closed sjay8 closed 2 months ago
@varjoshi: can you provide some guidance? Thanks!
You cannot load a 70B model an a T4 with 16 GB VRAM. Some guidance for VRAM size vs model (for Llama 3.1 but it is similar for other models) is here: https://huggingface.co/blog/llama31
Hello! I'm running Nemo Guardrails on Google Colab using the T4 GPU. However, when I deploy Lynx 70b using this code:
!python -m vllm.entrypoints.openai.api_server --port 5000 --model 'PatronusAI/Patronus-Lynx-70B-Instruct'
I have a Cuda out of memory issue:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU
Does anyone know what I can do?