explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.16k stars 729 forks source link

Runner thread raised exception and testset function is not accessible #1333

Closed Sam-364 closed 1 month ago

Sam-364 commented 1 month ago

I have checked both the documentation and ragas-langchain documentation and couldn't resolve my issue.

The bug Whenever I try to execute the testset using the generator module having the generate_with_langchain_docs function, I'm getting a thread handling error and the execution abruptly stops. I tried downgrading to ragas==0.1.7 which was the last version where no such errors were there but still it didn't work. Even using generate_with_llamaindex_docs, the same error is encountered so I tried to get the best of the two frameworks by combining the features of both the frameworks(i.e. using the document_loader of langchain and using the generate_with_llamaindex_docs of llamaindex) but the issue persisted. I have followed the documentation thoroughly but the bug couldn't be fixed. Passing the prescribed "raise_exceptions=False" also doesn't have any effect. I have used Ollama based local Llama versions as generator_llm and critic_llm. I had checked on individual arguments but it is not working.

Ragas version: 0.1.20 Python version: 3.10.12

Here is my detailed code:

import torch
from ragas.testset.generator import TestsetGenerator 
from ragas.testset.evolutions import simple, reasoning, multi_context  
from langchain_ollama import ChatOllama  
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM  
from langchain.embeddings import HuggingFaceEmbeddings  
from langchain_core.documents import Document as LCDocument  
import pymupdf  

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": device}

generator_llm = ChatOllama(model="llama3", temperature=0.2)
critic_llm = ChatOllama(model="llama3.1", temperature=0.2)

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

def load_pdf_as_document(file_path):
    doc = pymupdf.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return LCDocument(page_content=text)

pdf_path = "/content/1.pdf"
document = load_pdf_as_document(pdf_path)
documents = [document]

print(documents)

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings,
)

testset = generator.generate_with_langchain_docs(
    documents=documents,
    test_size=10,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
    with_debugging_logs=True,
    is_async=True,
    raise_exceptions=False,
)

testset.to_pandas()

test_df = testset.to_pandas()
test_df.head()

Error trace While executing the following code block:

testset = generator.generate_with_langchain_docs(
    documents=documents,
    test_size=10,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
    with_debugging_logs=True,
    is_async=True,
    raise_exceptions=False,
)

I am getting the following error:

ExceptionInRunner                         
                                     Traceback (most recent call last)
[<ipython-input-20-501c4800af7d>](https://localhost:8080/#) in <cell line: 1>()
----> 1 testset = generator.generate_with_langchain_docs(
      2         documents=documents,
      3         test_size=10,
      4         distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
      5         with_debugging_logs=True,

2 frames
[/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py](https://localhost:8080/#) in add_nodes(self, nodes, show_progress)
    251         results = executor.results()
    252         if not results:
--> 253             raise ExceptionInRunner()
    254 
    255         for i, n in enumerate(nodes):

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass `raise_exceptions=False` incase you want to show only a warning message instead.

Expected behavior The script was intended to create a synthetic QnA dataset based on the docs I used for the evaluation of a RAG pipeline but the execution is halted abruptly due to the bug. All the modules except of the testset module are working.

Hoping for quick fix to the issue because I saw many such issues like this in the generator module section.

dosubot[bot] commented 1 month ago

Hello, @Sam-364! I'm here to assist you with resolving bugs, answering questions, and becoming a contributor. Let's work together to tackle this issue!

There are a few similar issues that might help you resolve this problem:

  1. Generate test data for 1 pdf (open):

    • This issue discusses generating test data from a PDF file and provides a detailed code example. It might be useful to compare your code with the provided example to ensure all components are correctly initialized.
  2. Error in Testset Generation - ExceptionInRunner: The runner thread which was running the jobs raised an exception (closed):

    • This issue was addressed with a fix in version 0.1.10. Ensure you are using the latest version of Ragas, as this might resolve the issue.
  3. New executor throws RuntimeError: ... got Future <..> attached to a different loop (closed):

    • This issue was resolved by using nest_asyncio to handle event loops in Jupyter notebooks. You can try adding the following code to your script:
      import nest_asyncio
      nest_asyncio.apply()

The ExceptionInRunner error occurs because the generate_with_langchain_docs function calls the generate method with the raise_exceptions parameter set to True. This means any exception during the generation process will be raised [1].

Here is a modified version of your code with the nest_asyncio fix:

import torch
from ragas.testset.generator import TestsetGenerator 
from ragas.testset.evolutions import simple, reasoning, multi_context  
from langchain_ollama import ChatOllama  
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM  
from langchain.embeddings import HuggingFaceEmbeddings  
from langchain_core.documents import Document as LCDocument  
import pymupdf  
import nest_asyncio

nest_asyncio.apply()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": device}

generator_llm = ChatOllama(model="llama3", temperature=0.2)
critic_llm = ChatOllama(model="llama3.1", temperature=0.2)

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

def load_pdf_as_document(file_path):
    doc = pymupdf.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return LCDocument(page_content=text)

pdf_path = "/content/1.pdf"
document = load_pdf_as_document(pdf_path)
documents = [document]

print(documents)

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings,
)

testset = generator.generate_with_langchain_docs(
    documents=documents,
    test_size=10,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
    with_debugging_logs=True,
    is_async=True,
    raise_exceptions=False,
)

testset.to_pandas()

test_df = testset.to_pandas()
test_df.head()

Try running this updated script and see if it resolves the issue. If the problem persists, consider checking the detailed error trace for any specific issues related to your environment or dependencies.

To continue talking to Dosu, mention @dosu.

Sam-364 commented 1 month ago

I had previously used nest_asyncio.apply() but later realized the "is_async=True" parameter in testset is also performing the same thing wihout any recurrent effect. P.S. - Using the suggested changes will also not work because the testset module is already in async state.

chenchenhaha commented 1 month ago

after downgrading the version about langchain series packages,the same issue was solved.now my packages version are: langchain 0.2.16 langchain-community 0.2.0 langchain-core 0.2.41 langchain-openai 0.1.20 langchain-text-splitters 0.2.4

Sam-364 commented 1 month ago

Yes I forgot to comment, I did the same with downgrading the versionS of the packages and it worked for me but now the only issue is - it is taking an eternity to generate the dataset. P.S. - Hope, to have a discussion on that issue later on !

For now, I'm closing the issue.