how much GPU memored needed to inference the guanaco-13b?

I use the A10 which have 24GB GPU memory. I tried to inference guanaco-13b it have OOM issue.

Here are inference to load the model:

import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
# Setup the gradio Demo.
import datetime
import os
from threading import Event, Thread
from uuid import uuid4

import gradio as gr
import requests

model_name = "/home/ubuntu/ChatGPT/Models/meta/llama-13b-hf"
adapters_name = '/home/ubuntu/ChatGPT/Models/timdettmers/guanaco-13b'

print(f"Starting to load the model {model_name} into memory")

m = AutoModelForCausalLM.from_pretrained(
    model_name,
    #load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)
m = PeftModel.from_pretrained(m, adapters_name)
m = m.merge_and_unload()
tok = LlamaTokenizer.from_pretrained(model_name)
tok.bos_token_id = 1

stop_token_ids = [0]

print(f"Successfully loaded the model {model_name} into memory")

artidoro / qlora

how much GPU memored needed to inference the guanaco-13b? #158