artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314
MIT License
9.72k stars 798 forks source link

how much GPU memored needed to inference the guanaco-13b? #158

Open JustinZou1 opened 1 year ago

JustinZou1 commented 1 year ago

I use the A10 which have 24GB GPU memory. I tried to inference guanaco-13b it have OOM issue.

1686408108310

Here are inference to load the model:

import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
# Setup the gradio Demo.
import datetime
import os
from threading import Event, Thread
from uuid import uuid4

import gradio as gr
import requests

model_name = "/home/ubuntu/ChatGPT/Models/meta/llama-13b-hf"
adapters_name = '/home/ubuntu/ChatGPT/Models/timdettmers/guanaco-13b'

print(f"Starting to load the model {model_name} into memory")

m = AutoModelForCausalLM.from_pretrained(
    model_name,
    #load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)
m = PeftModel.from_pretrained(m, adapters_name)
m = m.merge_and_unload()
tok = LlamaTokenizer.from_pretrained(model_name)
tok.bos_token_id = 1

stop_token_ids = [0]

print(f"Successfully loaded the model {model_name} into memory")
eleluong commented 1 year ago

You can use CT2 for inference. Here is the example code for that: https://github.com/Actable-AI/llm-utils/tree/main/qlora2ct2