I use the A10 which have 24GB GPU memory. I tried to inference guanaco-13b it have OOM issue.
Here are inference to load the model:
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
# Setup the gradio Demo.
import datetime
import os
from threading import Event, Thread
from uuid import uuid4
import gradio as gr
import requests
model_name = "/home/ubuntu/ChatGPT/Models/meta/llama-13b-hf"
adapters_name = '/home/ubuntu/ChatGPT/Models/timdettmers/guanaco-13b'
print(f"Starting to load the model {model_name} into memory")
m = AutoModelForCausalLM.from_pretrained(
model_name,
#load_in_4bit=True,
torch_dtype=torch.bfloat16,
device_map={"": 0}
)
m = PeftModel.from_pretrained(m, adapters_name)
m = m.merge_and_unload()
tok = LlamaTokenizer.from_pretrained(model_name)
tok.bos_token_id = 1
stop_token_ids = [0]
print(f"Successfully loaded the model {model_name} into memory")
I use the A10 which have 24GB GPU memory. I tried to inference guanaco-13b it have OOM issue.
Here are inference to load the model: