QuangBK / localLLM_guidance

Local LLM ReAct Agent with Guidance
154 stars 24 forks source link

OOM - llama2 model - guidance.llm = guidance.llms.transformers.LLaMA("remyxai/ffmperative-7b", device_map="auto") #6

Open johndpope opened 1 year ago

johndpope commented 1 year ago

related - https://github.com/microsoft/guidance/issues/328

import gradio as gr
import guidance
import torch
from server.model import load_model_main
from server.tools import load_tools
from server.agent import CustomAgentGuidance

import os
os.environ["SERPER_API_KEY"] = 'REDACTED-BUT-DID-INCLUDE-IT'

MODEL_PATH = '/home/quang/working/LLMs/oobabooga_linux/text-generation-webui/models/TheBloke_wizard-mega-13B-GPTQ'
CHECKPOINT_PATH = '/home/quang/working/LLMs/oobabooga_linux/text-generation-webui/models/TheBloke_wizard-mega-13B-GPTQ/wizard-mega-13B-GPTQ-4bit-128g.no-act.order.safetensors'
DEVICE = torch.device('cuda:0')

examples = [
    ["How much is the salary of number 8 of Manchester United?"],
    ["What is the population of Congo?"],
    ["Where was the first president of South Korean born?"],
    ["What is the population of the country that won World Cup 2022?"]    
]

def greet(name):
    final_answer = custom_agent(name)
    return final_answer, final_answer['fn']

# model, tokenizer = load_model_main(MODEL_PATH, CHECKPOINT_PATH, DEVICE)
# llama = guidance.llms.Transformers(model=model, tokenizer=tokenizer, device=0)
# guidance.llm = llama 
## OVER-RIDIING HERE------------------------------------->>>>>>>>>>>>>>>>>>
guidance.llm = guidance.llms.transformers.LLaMA("remyxai/ffmperative-7b", device_map="auto")

dict_tools = load_tools()

custom_agent = CustomAgentGuidance(guidance, dict_tools)

list_outputs = [gr.Textbox(lines=5, label="Reasoning"), gr.Textbox(label="Final Answer")]
demo = gr.Interface(fn=greet, inputs=gr.Textbox(lines=1, label="Input Text", placeholder="Enter a question here..."), 
                    outputs=list_outputs,
                    title="Demo ReAct agent with Guidance",
                    description="The source code can be found at: https://github.com/QuangBK/localLLM_guidance/",
                   examples=examples)
demo.launch(server_name="0.0.0.0", server_port=7860)

File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.69 GiB total capacity; 21.41 GiB already allocated; 4.94 MiB free; 21.61 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Error in program:  CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.69 GiB total capacity; 21.41 GiB already allocated; 4.94 MiB free; 21.61 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/gradio/routes.py", line 442, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/gradio/blocks.py", line 1392, in process_api
    result = await self.call_function(
  File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/gradio/blocks.py", line 1097, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/gradio/utils.py", line 703, in wrapper
    response = f(*args, **kwargs)
  File "/media/2TB/localLLM_guidance/app.py", line 23, in greet
    final_answer = custom_agent(name)
  File "/media/2TB/localLLM_guidance/server/agent.py", line 77, in __call__
    if result_mid['answer'] == 'Final Answer':
  File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/guidance/_program.py", line 470, in __getitem__
    return self._variables[key]
QuangBK commented 1 year ago

Hi, the OOM error most likely means you don't have enough VRAM. I checked the code, it seems like you are loading the hf model, which may be not a quantized version. It consumes more VRAM than the GPTQ 4bit version.

johndpope commented 1 year ago

I have 24gb - it's only remyxai/ffmperative-7b model.