explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.79k stars 669 forks source link

Extremly high gpu memory consumption for evluation with custom LLM (over 60 GBs) #645

Open Johncrtz opened 7 months ago

Johncrtz commented 7 months ago

Hello, i created a testset and run it through my RAG pipeline to get documents and a answer for each question. I now have 50 pairs of [question, ground_truth, documents, answer] that i want get the context_recall from.

Code for my custom LLM:

from torch import cuda, bfloat16 
from torch import nn
import transformers
from transformers import BitsAndBytesConfig
from transformers import LlamaTokenizer
import os
from langchain.llms import HuggingFacePipeline 

model_id = 'meta-llama/Llama-2-7b-chat-hf' #'HuggingFaceH4/zephyr-7b-alpha' 
hf_auth = '...' 
os.environ['OPENAI_API_KEY'] = "..."

bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, 
                                             bnb_4bit_quant_type='nf4',
                                             bnb_4bit_use_double_quant=True, 
                                             bnb_4bit_compute_dtype=bfloat16 ) # begin initializing HF items, need auth token for these 

model_config = transformers.AutoConfig.from_pretrained( model_id, token=hf_auth )

model = transformers.AutoModelForCausalLM.from_pretrained(model_id, 
                                                          trust_remote_code=True,
                                                          config=model_config,
                                                          quantization_config=bnb_config,
                                                          device_map="auto",
                                                          token=hf_auth ) 

tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf",token=hf_auth)

generate_text = transformers.pipeline( model=model, tokenizer=tokenizer,
                                      return_full_text=True, # langchain expects the full text
                                      task='text-generation', # we pass model parameters here too 
                                      temperature=0.01, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
                                      max_new_tokens=512,
                                      repetition_penalty=1.1, # without this output begins repeating
                                      use_cache=True,
                                      #num_return_sequences=1,
                                      #eos_token_id=tokenizer.eos_token_id, 
                                      #pad_token_id=tokenizer.eos_token_id, 
                                     )

llm = HuggingFacePipeline(pipeline=generate_text)``

After i do the evaluation:

from ragas import evaluate
from ragas.metrics import (
    context_recall,
)

result = evaluate(
    dataset,
    metrics=[,
        context_recall,
    ],
    llm = llm
)

result

It runs for a while and then ends up with the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB. GPU 1 has a total capacity of 31.74 GiB of which 29.44 MiB is free. Including non-PyTorch memory, this process has 31.70 GiB memory in use. Of the allocated memory 27.44 GiB is allocated by PyTorch, and 3.65 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I tried to customize pytorch memory config to make it more efficient, this however did no change:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "backend:native"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "garbage_collection_threshold:0.8"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"

Here you can see my memory allocation:

Device 0: Tesla V100-SXM2-32GB
  Total memory: 31.74 GB
  Allocated memory: 22.81 GB
  Cached memory: 31.10 GB
Device 1: Tesla V100-SXM2-32GB
  Total memory: 31.74 GB
  Allocated memory: 27.59 GB
  Cached memory: 31.11 GB

Is there any way to do the evaluation with a custom LLM and not consume ungodly amounts of memory? imo 50 questions are not too much and i jus expected it to work. Does somone know how to handle this?

SuperYG1991 commented 6 months ago

No wonder my notebook kernel is always died when I run the ragas evaluation with llamaCPP

husongjiang commented 4 months ago

我只使用一条数据,测试一个指标,有些时候都会out of memory。看起来像是计算时同时进行推理。尝试改代码但是没成功。

LorenzoGalizia commented 4 months ago

Hello, I am encountering the same problems running the evalutaion on RAGAS with Llama3 as opensource evaluator model.

Did you find any solution?