explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.63k stars 649 forks source link

Synthetic test data generation failed for huggingface models #1338

Open wanjeakshay opened 1 day ago

wanjeakshay commented 1 day ago

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug I am trying to generate test data for ragas evaluation, I have used huggingface open source models as llm and critic llm and for embedding too I have used open source model, I am getting error in get node function in ragas/testset generator.py file the error is as follows: WARNING:ragas.testset.docstore:Filename and doc_id are the same for all nodes.

ValueError Traceback (most recent call last) in <cell line: 7>() 5 6 # Generate the test set ----> 7 testset = generator.generate_with_langchain_docs( 8 documents=documents, 9 test_size=5, # Example test size

3 frames /usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate_with_langchain_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 208 ) 209 --> 210 return self.generate( 211 test_size=test_size, 212 distributions=distributions,

/usr/local/lib/python3.10/dist-packages/ragas/_analytics.py in wrapper(*args, kwargs) 127 def wrapper(*args: P.args, *kwargs: P.kwargs) -> t.Any: 128 track(IsCompleteEvent(event_type=func.name, is_completed=False)) --> 129 result = func(args, kwargs) 130 track(IsCompleteEvent(event_type=func.name, is_completed=True)) 131

/usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate(self, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 278 current_nodes = [ 279 CurrentNodes(root_node=n, nodes=[n]) --> 280 for n in self.docstore.get_random_nodes(k=test_size) 281 ] 282 total_evolutions = 0

/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in get_random_nodes(self, k, alpha) 325 prob = prob / np.sum(prob) 326 --> 327 nodes = self.run_config.rng.choice( 328 np.array(self.nodes), size=k, p=prob 329 ).tolist()

numpy/random/_generator.pyx in numpy.random._generator.Generator.choice()

ValueError: a cannot be empty unless no samples are taken

Ragas version: 0.1.20 Python version: 3.10.0

Code to Reproduce `from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sangmini/msmarco-cotmae-MiniLM-L12_en-ko-ja")`

`repo_id = "mistralai/Mistral-Nemo-Instruct-2407"

llm_model = HuggingFaceEndpoint( repo_id=repo_id,

max_length=128,

temperature=0.5,
huggingfacehub_api_token=hf_api_key,

)`

`# llm_chain, llm_model generator = TestsetGenerator.from_langchain( generator_llm=llm_model, critic_llm=llm_model, embeddings=embeddings, chunk_size=500 )

Generate the test set

testset = generator.generate_with_langchain_docs( documents=documents, test_size=5, # Example test size

distributions={simple: 0.5, reasoning: 0.3, multi_context: 0.2},

with_debugging_logs=False, is_async=True, raise_exceptions=False, # Set to False to avoid ExceptionInRunner error )` Error trace

Expected behavior athe test data should be generated

Additional context Add any other context about the problem here.

dosubot[bot] commented 1 day ago

I found several similar issues that might help you resolve the ValueError: "a cannot be empty unless no samples are taken" error:

  1. Ensure Documents are Loaded: This error typically occurs when the documents list is empty. Ensure that the documents list is not empty before calling the generate_with_langchain_docs method. Here is an example of how to check if the documents list is not empty:

    # Ensure documents are loaded
    if not documents:
       raise ValueError("The documents list cannot be empty.")
    
    # Generate the test set
    testset = generator.generate_with_langchain_docs(
       documents=documents,
       test_size=5,  # Example test size
       with_debugging_logs=False,
       is_async=True,
       raise_exceptions=False,  # Set to False to avoid ExceptionInRunner error
       run_config=run_config
    )

    Additionally, ensure that your documents variable is correctly populated with valid data. Here is an example of loading documents:

    from llama_index import SimpleDirectoryReader
    
    # Load documents
    dir_path = "./path-to-your-documents"
    reader = SimpleDirectoryReader(dir_path, num_files_limit=2)
    documents = reader.load_data()
    
    # Check if documents are loaded
    if not documents:
       raise ValueError("No documents found. Ensure the directory contains valid files.")

    This should help you resolve the issue and generate the synthetic test data successfully [1].

  2. Ensure Enough Nodes in Docstore: Ensure that the docstore contains enough nodes to match the test_size. Here are some steps you can take:

    current_nodes = [
       CurrentNodes(root_node=n, nodes=[n])
       for n in self.docstore.get_random_nodes(k=test_size)
    ]
    
    if len(current_nodes) < test_size:
       raise ValueError("Not enough nodes in the docstore to generate the test set.")

    By implementing these checks, you can ensure that you do not attempt to generate a test set with an empty list of documents or insufficient nodes in the docstore [2].

  3. Check for Short Documents: From a similar issue, it was found that short documents might cause this error. Ensure that your documents are not too short for the QA evaluation process. You might need to pass larger chunks or modify the test distribution.

    # Example modification
    testset = generator.generate_with_langchain_docs(
       documents=documents,
       test_size=5,
       distributions={'simple': 1.0},
       with_debugging_logs=False,
       is_async=True,
       raise_exceptions=False,
    )

    This should help avoid the "ValueError: a cannot be empty unless no samples are taken" error [3].

By following these steps, you should be able to resolve the issue and successfully generate the synthetic test data.

To continue talking to Dosu, mention @dosu.

wanjeakshay commented 1 day ago

@peterkaminski I am running this is google colab, do you know what is the issue in node creation and how can i resolve it?