explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.48k stars 629 forks source link

ValueError: a cannot be empty unless no samples are taken #1087

Open Rugved2204 opened 1 month ago

Rugved2204 commented 1 month ago

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug ValueError: a cannot be empty unless no samples are taken

Ragas version: 0.1.10 Python version:3.10.12

Code to Reproduce Share code to reproduce the issue

loader = PubMedLoader("liver", load_max_docs=10)
documents = loader.load()

embedding_model_name = 'BAAI/bge-small-en-v1.5'
embeddings = HuggingFaceEmbeddings(
      model_name=embedding_model_name,
      model_kwargs={'device':'cuda:0'}
   )

model_name = 'mistralai/Mistral-7B-Instruct-v0.2'

response_generation_llm = load_model(model_name, 0.1, 1024)

generator = TestsetGenerator.from_langchain(
    response_generation_llm,
    response_generation_llm,
    embeddings
)

testset = generator.generate_with_langchain_docs(documents, test_size=2, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

Error trace

File "/home/ubuntu/scp-analyzer/rag-old/test_ragas.py", line 31, in <module>
    testset = generator.generate_with_langchain_docs(documents, test_size=2, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ragas/testset/generator.py", line 210, in generate_with_langchain_docs
    return self.generate(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ragas/testset/generator.py", line 279, in generate
    for n in self.docstore.get_random_nodes(k=test_size)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ragas/testset/docstore.py", line 328, in get_random_nodes
    nodes = rng.choice(np.array(self.nodes), size=k, p=prob).tolist()
  File "numpy/random/_generator.pyx", line 803, in numpy.random._generator.Generator.choice
ValueError: a cannot be empty unless no samples are taken

Expected behavior It should have generated the synthetic testset. Nodes are getting embedded but getting an error while generating

Additional context Add any other context about the problem here.

dosubot[bot] commented 1 month ago

To resolve the ValueError: a cannot be empty unless no samples are taken error when generating a synthetic testset using the Ragas library, ensure that the test_size parameter is greater than zero and that the docstore contains enough documents to sample from. The error typically occurs when the list of documents to sample from is empty.

Here is the relevant part of the code where the error might be occurring:

current_nodes = [
    CurrentNodes(root_node=n, nodes=[n])
    for n in self.docstore.get_random_nodes(k=test_size)
]

Make sure that self.docstore.get_random_nodes(k=test_size) returns a non-empty list. If test_size is greater than the number of available documents, this could lead to the error.

Here is a potential fix to ensure that test_size is valid and that there are enough documents in the docstore:

def generate(
    self,
    test_size: int,
    distributions: t.Optional[Distributions] = None,
    with_debugging_logs=False,
    is_async: bool = True,
    raise_exceptions: bool = True,
    run_config: t.Optional[RunConfig] = None,
):
    distributions = distributions or DEFAULT_DISTRIBUTION
    # validate distributions
    if not check_if_sum_is_close(list(distributions.values()), 1.0, 3):
        raise ValueError(
            f"distributions passed do not sum to 1.0 [got {sum(list(distributions.values()))}]. Please check the "
            f"distributions."
        )

    # configure run_config for docstore
    if run_config is None:
        run_config = RunConfig(max_retries=15, max_wait=90)
    self.docstore.set_run_config(run_config)

    # init filters and evolutions
    for evolution in distributions:
        self.init_evolution(evolution)
        evolution.init(is_async=is_async, run_config=run_config)

    if with_debugging_logs:
        from ragas.utils import patch_logger

        patch_logger("ragas.testset.evolutions", logging.DEBUG)
        patch_logger("ragas.testset.extractor", logging.DEBUG)
        patch_logger("ragas.testset.filters", logging.DEBUG)
        patch_logger("ragas.testset.docstore", logging.DEBUG)
        patch_logger("ragas.llms.prompt", logging.DEBUG)

    exec = Executor(
        desc="Generating",
        keep_progress_bar=True,
        raise_exceptions=raise_exceptions,
        run_config=run_config,
    )

    available_nodes = self.docstore.get_random_nodes(k=test_size)
    if not available_nodes:
        raise ValueError("No documents available in the docstore to sample from.")

    current_nodes = [
        CurrentNodes(root_node=n, nodes=[n])
        for n in available_nodes
    ]
    total_evolutions = 0
    for evolution, probability in distributions.items():
        for i in sample(range(test_size), round(probability * test_size)):
            exec.submit(
                evolution.evolve,
                current_nodes[i],
                name=f"{evolution.__class__.__name__}-{i}",
            )
            total_evolutions += 1
    if total_evolutions <= test_size:
        filler_evolutions = choices(
            list(distributions), k=test_size - total_evolutions
        )
        for evolution in filler_evolutions:
            exec.submit(
                evolution.evolve,
                current_nodes[total_evolutions],
                name=f"{evolution.__class__.__name__}-{total_evolutions}",
            )
            total_evolutions += 1

    try:
        test_data_rows = exec.results()
        if not test_data_rows:
            raise ExceptionInRunner()

    except ValueError as e:
        raise e
    # make sure to ignore any NaNs that might have been returned
    # due to failed evolutions. MaxRetriesExceeded is a common reason
    test_data_rows = [r for r in test_data_rows if not is_nan(r)]
    test_dataset = TestDataset(test_data=test_data_rows)
    evol_lang = [get_feature_language(e) for e in distributions]
    evol_lang = [e for e in evol_lang if e is not None]
    track(
        TestsetGenerationEvent(
            event_type="testset_generation",
            evolution_names=[e.__class__.__name__.lower() for e in distributions],
            evolution_percentages=[distributions[e] for e in distributions],
            num_rows=len(test_dataset.test_data),
            language=evol_lang[0] if len(evol_lang) > 0 else "",
        )
    )

    return test_dataset

This code ensures that there are enough documents in the docstore before attempting to generate the testset [1].

To continue talking to Dosu, mention @dosu.

Rugved2204 commented 1 month ago

@dosu Traceback (most recent call last): File "/home/ubuntu/scp-analyzer/rag-old/test_ragas.py", line 31, in testset = generator.generate_with_langchain_docs(documents, test_size=2, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}) File "/home/ubuntu/.local/lib/python3.10/site-packages/ragas/testset/generator.py", line 210, in generate_with_langchain_docs return self.generate( File "/home/ubuntu/.local/lib/python3.10/site-packages/ragas/testset/generator.py", line 277, in generate available_nodes = self.docstore.get_random_nodes(k=test_size) File "/home/ubuntu/.local/lib/python3.10/site-packages/ragas/testset/docstore.py", line 328, in get_random_nodes nodes = rng.choice(np.array(self.nodes), size=k, p=prob).tolist() File "numpy/random/_generator.pyx", line 803, in numpy.random._generator.Generator.choice ValueError: a cannot be empty unless no samples are taken

The error still persists even after changing the code.

dosubot[bot] commented 1 month ago

The error persists because the self.docstore.get_random_nodes(k=test_size) method is still returning an empty list. This indicates that there are no nodes available in the docstore to sample from. Here are a few steps to diagnose and resolve the issue:

  1. Check Document Loading: Ensure that the PubMedLoader is correctly loading documents and that these documents are being added to the docstore.

  2. Verify Node Addition: Confirm that the nodes are being correctly added to the docstore. You can add some debug prints to check the number of nodes in the docstore after loading documents.

  3. Inspect get_random_nodes Method: Review the get_random_nodes method to ensure it is functioning correctly and that the probability distribution is valid.

Here is a modified version of your script with additional debug prints to help diagnose the issue:

loader = PubMedLoader("liver", load_max_docs=10)
documents = loader.load()

print(f"Loaded {len(documents)} documents")

embedding_model_name = 'BAAI/bge-small-en-v1.5'
embeddings = HuggingFaceEmbeddings(
      model_name=embedding_model_name,
      model_kwargs={'device':'cuda:0'}
   )

model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
response_generation_llm = load_model(model_name, 0.1, 1024)

generator = TestsetGenerator.from_langchain(
    response_generation_llm,
    response_generation_llm,
    embeddings
)

# Add documents to the docstore
generator.docstore.add_documents(documents)
print(f"Docstore contains {len(generator.docstore.nodes)} nodes")

# Attempt to generate the testset
try:
    testset = generator.generate_with_langchain_docs(documents, test_size=2, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
    print("Testset generated successfully")
except ValueError as e:
    print(f"Error: {e}")

This script will print the number of documents loaded and the number of nodes in the docstore. If the number of nodes is zero, it indicates that the documents are not being correctly added to the docstore.

Additionally, ensure that the get_random_nodes method in the docstore class is correctly implemented and that the probability distribution is valid. Here is a snippet of the get_random_nodes method for reference:

def get_random_nodes(self, k=1, alpha=0.1) -> t.List[Node]:
    def adjustment_factor(wins, alpha):
        return np.exp(-alpha * wins)

    scores = [adjustment_factor(node.wins, alpha) for node in self.nodes]
    similarity_scores = [node.doc_similarity for node in self.nodes]
    prob = np.array(scores) * np.array(similarity_scores)
    prob = prob / np.sum(prob)

    if len(self.nodes) == 0 or np.sum(prob) == 0:
        return []

    nodes = rng.choice(np.array(self.nodes), size=k, p=prob).tolist()

    for node in nodes:
        idx = self.nodes.index(node)
        self.nodes[idx].wins += 1

    return nodes

Ensure that the self.nodes list is not empty and that the probability distribution prob is valid. If self.nodes is empty or np.sum(prob) is zero, the method will return an empty list, leading to the error [1].

To continue talking to Dosu, mention @dosu.

Rugved2204 commented 1 month ago

@dosu yes the documents are not getting added to the docstore. Also I am getting Loaded 10 Documents too. Any Specific reason for that

dosubot[bot] commented 1 month ago

The documents are not being added to the docstore because the add_nodes method might not be called correctly, or the nodes are not properly initialized before being added. Additionally, ensure that the embeddings and extractor are set before adding nodes.

Here's an example of how to correctly add nodes to the InMemoryDocumentStore:

from ragas.testset.docstore import InMemoryDocumentStore
from ragas.testset.extractor import KeyphraseExtractor
from langchain.text_splitter import TokenTextSplitter
from ragas.embeddings import HuggingFaceEmbeddings
from ragas.llms import load_model

# Initialize components
embedding_model_name = 'BAAI/bge-small-en-v1.5'
embeddings = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs={'device': 'cuda:0'}
)

model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
response_generation_llm = load_model(model_name, 0.1, 1024)

splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
keyphrase_extractor = KeyphraseExtractor(llm=response_generation_llm)

# Initialize the DocumentStore
docstore = InMemoryDocumentStore(
    splitter=splitter,
    embeddings=embeddings,
    extractor=keyphrase_extractor,
    llm=response_generation_llm
)

# Load documents
loader = PubMedLoader("liver", load_max_docs=10)
documents = loader.load()
print(f"Loaded {len(documents)} documents")

# Add documents to the docstore
docstore.add_documents(documents)
print(f"Docstore contains {len(docstore.nodes)} nodes")

# Attempt to generate the testset
generator = TestsetGenerator.from_langchain(
    response_generation_llm,
    response_generation_llm,
    embeddings,
    docstore=docstore
)

try:
    testset = generator.generate_with_langchain_docs(documents, test_size=2, distributions={'simple': 0.5, 'reasoning': 0.25, 'multi_context': 0.25})
    print("Testset generated successfully")
except ValueError as e:
    print(f"Error: {e}")

Ensure that the add_nodes method is called with properly initialized nodes and that the embeddings and extractor are set before adding nodes. This should ensure that the documents are correctly added to the docstore [1][2].

To continue talking to Dosu, mention @dosu.

Rugved2204 commented 1 month ago

I am using my own LLM and embeddings from hugging face so can you help me to make it as a subclass of BaseRagasLLM and BaseRagasEmbeddings

dosubot[bot] commented 1 month ago

To integrate your custom LLM and embeddings from Hugging Face by subclassing BaseRagasLLM and BaseRagasEmbeddings, you can follow these examples:

Subclassing BaseRagasLLM

from ragas.llm.base import BaseRagasLLM  # Adjust the import based on actual location
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class CustomHuggingfaceLLM(BaseRagasLLM):
    def __init__(self, model_name: str, cache_folder: t.Optional[str] = None, model_kwargs: t.Dict[str, t.Any] = None):
        self.model_name = model_name
        self.cache_folder = cache_folder
        self.model_kwargs = model_kwargs if model_kwargs else {}
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, cache_dir=self.cache_folder)
        self.model = AutoModelForCausalLM.from_pretrained(self.model_name, cache_dir=self.cache_folder, **self.model_kwargs)

    def generate(self, prompt: str, max_length: int = 50) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt")
        outputs = self.model.generate(inputs["input_ids"], max_length=max_length)
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

Subclassing BaseRagasEmbeddings

from ragas.embeddings.base import BaseRagasEmbeddings  # Adjust the import based on actual location
from transformers import AutoModel, AutoTokenizer
import torch

class CustomHuggingfaceEmbeddings(BaseRagasEmbeddings):
    def __init__(self, model_name: str, cache_folder: t.Optional[str] = None, model_kwargs: t.Dict[str, t.Any] = None):
        self.model_name = model_name
        self.cache_folder = cache_folder
        self.model_kwargs = model_kwargs if model_kwargs else {}
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, cache_dir=self.cache_folder)
        self.model = AutoModel.from_pretrained(self.model_name, cache_dir=self.cache_folder, **self.model_kwargs)

    def embed_query(self, text: str) -> List[float]:
        return self.embed_documents([text])[0]

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        inputs = self.tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1).tolist()
        return embeddings

These examples show how to subclass BaseRagasLLM and BaseRagasEmbeddings to integrate custom models from Hugging Face. Adjust the imports and methods as necessary based on your specific requirements and the actual implementation of BaseRagasLLM and BaseRagasEmbeddings [1][2].

To continue talking to Dosu, mention @dosu.

jjmachan commented 1 month ago

@Rugved2204 were you able to solve this?