NVIDIA / NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
Other
4.22k stars 402 forks source link

Using Lynx 70B Cuda out of memory #658

Closed sjay8 closed 2 months ago

sjay8 commented 4 months ago

Hello! I'm running Nemo Guardrails on Google Colab using the T4 GPU. However, when I deploy Lynx 70b using this code: !python -m vllm.entrypoints.openai.api_server --port 5000 --model 'PatronusAI/Patronus-Lynx-70B-Instruct'

I have a Cuda out of memory issue:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU

Does anyone know what I can do?

drazvan commented 3 months ago

@varjoshi: can you provide some guidance? Thanks!

trebedea commented 2 months ago

You cannot load a 70B model an a T4 with 16 GB VRAM. Some guidance for VRAM size vs model (for Llama 3.1 but it is similar for other models) is here: https://huggingface.co/blog/llama31