Do we need to chunk documents before text set generation?

explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀

https://docs.ragas.io

Apache License 2.0

7.33k stars 746 forks source link

Do we need to chunk documents before text set generation? #1098

Open hanfei1986 opened 4 months ago

hanfei1986 commented 4 months ago

The embedding model is used for TestsetGenerator:

generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embedding_model)
dataset = generator.generate_with_langchain_docs(documents, test_size=100, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

Does this mean we have to chunk the documents to make documents shorter than the max_seq_length of the embedding model?

dosubot[bot] commented 4 months ago

@hanfei1986 Hello there! I'm here to help you with any bugs, questions, or contributions while we wait for a human maintainer. Let's squash those bugs together! 🤖

Yes, you need to chunk documents before text set generation when using the embedding model for TestsetGenerator. This ensures that the documents do not exceed the max_seq_length of the embedding model. The TokenTextSplitter is typically used for this purpose:

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)

This splitter will chunk the documents into smaller pieces before they are processed and stored in the InMemoryDocumentStore ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

jjmachan commented 3 months ago

@hanfei1986 @dosu was wrong here, don't chunk the documents. We do some chunking for the documents ourselves in a way that makes sense for the testset generation module

dosubot[bot] commented 3 months ago

You do not need to chunk the documents before text set generation. The TestsetGenerator module handles the chunking of documents internally in a way that makes sense for the test set generation process. This means you can pass your documents directly to the generate_with_langchain_docs method without worrying about their length relative to the max_seq_length of the embedding model.

To continue talking to Dosu, mention @dosu.