Closed murdadesmaeeli closed 3 months ago
@murdadesmaeeli Hello there! I'm here to help you with any bugs, questions, or contributions you have while waiting for a human maintainer. Let's dive in and see how I can assist you! 🤖
The number of retrieved chunks is set to 2 and chunk overlap is set to 0 in testset.generator.from_langchain
to control the granularity and overlap of text chunks during the test set generation process. These settings can be modified by adjusting the parameters of the TokenTextSplitter
used in the DocumentStore
.
To modify these settings, you can change the chunk_size
and chunk_overlap
parameters when initializing the TokenTextSplitter
:
from langchain.text_splitter import TokenTextSplitter
from ragas.testset.docstore import InMemoryDocumentStore
from ragas.testset.extractor import KeyphraseExtractor
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
# define llm and embeddings
langchain_llm = BaseLanguageModel(model="my_model") # any langchain LLM instance
langchain_embeddings = Embeddings(model="my_model") # any langchain Embeddings instance
# wrap them with wrappers
langchain_llm = LangchainLLMWrapper(langchain_llm)
langchain_embeddings = LangchainEmbeddingsWrapper(langchain_embeddings)
# init the DocumentStore with your own llm and embeddings
splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=0) # Modify chunk_size and chunk_overlap here
keyphrase_extractor = KeyphraseExtractor(llm=langchain_llm)
docstore = InMemoryDocumentStore(
splitter=splitter,
embeddings=langchain_embeddings,
extractor=keyphrase_extractor,
)
In this example, chunk_size
is set to 1000 and chunk_overlap
is set to 0. You can adjust these values as needed [1].
To continue talking to Dosu, mention @dosu.
@murdadesmaeeli it was decided based on what works best for Arxiv papers. with #1016 we will improve it to be a bit more customisable.
let me know if you have any more questions?
It seems the issue was answered, closing this now.
Is there a specific reason these decisions were made about the testset setup and is there a way to influence number of chunks retrieved and chunk overlap in the code below: