explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.27k stars 598 forks source link

Do we need to chunk documents before text set generation? #1098

Open hanfei1986 opened 1 month ago

hanfei1986 commented 1 month ago

The embedding model is used for TestsetGenerator:

generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embedding_model)
dataset = generator.generate_with_langchain_docs(documents, test_size=100, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

Does this mean we have to chunk the documents to make documents shorter than the max_seq_length of the embedding model?

dosubot[bot] commented 1 month ago

@hanfei1986 Hello there! I'm here to help you with any bugs, questions, or contributions while we wait for a human maintainer. Let's squash those bugs together! 🤖

Yes, you need to chunk documents before text set generation when using the embedding model for TestsetGenerator. This ensures that the documents do not exceed the max_seq_length of the embedding model. The TokenTextSplitter is typically used for this purpose:

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)

This splitter will chunk the documents into smaller pieces before they are processed and stored in the InMemoryDocumentStore [1][2][3].

To continue talking to Dosu, mention @dosu.

jjmachan commented 3 weeks ago

@hanfei1986 @dosu was wrong here, don't chunk the documents. We do some chunking for the documents ourselves in a way that makes sense for the testset generation module

dosubot[bot] commented 3 weeks ago

You do not need to chunk the documents before text set generation. The TestsetGenerator module handles the chunking of documents internally in a way that makes sense for the test set generation process. This means you can pass your documents directly to the generate_with_langchain_docs method without worrying about their length relative to the max_seq_length of the embedding model.

To continue talking to Dosu, mention @dosu.