Open hanfei1986 opened 4 months ago
@hanfei1986 Hello there! I'm here to help you with any bugs, questions, or contributions while we wait for a human maintainer. Let's squash those bugs together! 🤖
Yes, you need to chunk documents before text set generation when using the embedding model for TestsetGenerator
. This ensures that the documents do not exceed the max_seq_length
of the embedding model. The TokenTextSplitter
is typically used for this purpose:
from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)
This splitter will chunk the documents into smaller pieces before they are processed and stored in the InMemoryDocumentStore
[1][2][3].
To continue talking to Dosu, mention @dosu.
@hanfei1986 @dosu was wrong here, don't chunk the documents. We do some chunking for the documents ourselves in a way that makes sense for the testset generation module
You do not need to chunk the documents before text set generation. The TestsetGenerator
module handles the chunking of documents internally in a way that makes sense for the test set generation process. This means you can pass your documents directly to the generate_with_langchain_docs
method without worrying about their length relative to the max_seq_length
of the embedding model.
To continue talking to Dosu, mention @dosu.
The embedding model is used for TestsetGenerator:
Does this mean we have to chunk the documents to make documents shorter than the max_seq_length of the embedding model?