why is number of retrieved chunks is 2 and chunk overlap is 0 in testset.generator.from_langchain?

murdadesmaeeli commented 4 months ago

Is there a specific reason these decisions were made about the testset setup and is there a way to influence number of chunks retrieved and chunk overlap in the code below:

@classmethod
def from_langchain(
    cls,
    generator_llm: LangchainLLM,
    critic_llm: LangchainLLM,
    embeddings: LangchainEmbeddings,
    docstore: t.Optional[DocumentStore] = None,
    run_config: t.Optional[RunConfig] = None,
    chunk_size: int = 1024,
) -> "TestsetGenerator":
    generator_llm_model = LangchainLLMWrapper(generator_llm)
    critic_llm_model = LangchainLLMWrapper(critic_llm)
    embeddings_model = LangchainEmbeddingsWrapper(embeddings)

    keyphrase_extractor = KeyphraseExtractor(llm=generator_llm_model)
    if docstore is None:
        from langchain.text_splitter import TokenTextSplitter

        splitter = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
        docstore = InMemoryDocumentStore(
            splitter=splitter,
            embeddings=embeddings_model,
            extractor=keyphrase_extractor,
            run_config=run_config,
        )
    return cls(
        generator_llm=generator_llm_model,
        critic_llm=critic_llm_model,
        embeddings=embeddings_model,
        docstore=docstore,
    )

dosubot[bot] commented 4 months ago

@murdadesmaeeli Hello there! I'm here to help you with any bugs, questions, or contributions you have while waiting for a human maintainer. Let's dive in and see how I can assist you! 🤖

The number of retrieved chunks is set to 2 and chunk overlap is set to 0 in testset.generator.from_langchain to control the granularity and overlap of text chunks during the test set generation process. These settings can be modified by adjusting the parameters of the TokenTextSplitter used in the DocumentStore.

To modify these settings, you can change the chunk_size and chunk_overlap parameters when initializing the TokenTextSplitter:

from langchain.text_splitter import TokenTextSplitter
from ragas.testset.docstore import InMemoryDocumentStore
from ragas.testset.extractor import KeyphraseExtractor
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# define llm and embeddings
langchain_llm = BaseLanguageModel(model="my_model") # any langchain LLM instance
langchain_embeddings = Embeddings(model="my_model") # any langchain Embeddings instance

# wrap them with wrappers
langchain_llm = LangchainLLMWrapper(langchain_llm)
langchain_embeddings = LangchainEmbeddingsWrapper(langchain_embeddings)

# init the DocumentStore with your own llm and embeddings
splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=0)  # Modify chunk_size and chunk_overlap here
keyphrase_extractor = KeyphraseExtractor(llm=langchain_llm)
docstore = InMemoryDocumentStore(
    splitter=splitter,
    embeddings=langchain_embeddings,
    extractor=keyphrase_extractor,
)

In this example, chunk_size is set to 1000 and chunk_overlap is set to 0. You can adjust these values as needed ^[1].

To continue talking to Dosu, mention @dosu.

jjmachan commented 3 months ago

@murdadesmaeeli it was decided based on what works best for Arxiv papers. with #1016 we will improve it to be a bit more customisable.

let me know if you have any more questions?

github-actions[bot] commented 3 months ago

It seems the issue was answered, closing this now.

explodinggradients / ragas

why is number of retrieved chunks is 2 and chunk overlap is 0 in testset.generator.from_langchain? #1074