How does RAGAS generate ground_truth in TestSetGenerator class

explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀

https://docs.ragas.io

Apache License 2.0

7.15k stars 727 forks source link

How does RAGAS generate ground_truth in TestSetGenerator class #1322

Open adityamity opened 1 month ago

adityamity commented 1 month ago

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question How does RAGAS generate ground_truth in TestSetGenerator class

dosubot[bot] commented 1 month ago

The TestsetGenerator class in RAGAS generates ground_truth as part of the synthetic Question/Context/Ground_Truth samples by utilizing OpenAI models. The process involves loading documents, configuring the generator with language models, and then generating the test set with specified distributions.

Here is the relevant code snippet:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# generate testset
testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

This code sets up the TestsetGenerator with OpenAI models and generates a test set from the loaded documents, which includes the ground_truth as part of the synthetic samples ^[1].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu I want to understand where exactly in the code, the ground_truth is generated as a column in testset?

dosubot[bot] commented 1 month ago

The ground_truth is generated as a column in the test set within the upload_to_langsmith function. Specifically, it is created when the pandas DataFrame is constructed from the dataset dictionary:

dataset = {"question": questions, "ground_truth": answers}
def upload_to_langsmith(dataset_name: str) -> UUID:
    # Creating a pandas DataFrame from the dataset dictionary
    df = pd.DataFrame(dataset)
    ...

Here, the dataset dictionary contains the questions and answers lists, which are used to create the DataFrame df with columns "question" and "ground_truth" ^[1].

Additionally, the ground_truth is also prepared in the dataset preparation step for Ragas, as shown in the following example:

eval_dataset = load_dataset("explodinggradients/prompt-engineering-guide-papers")
eval_dataset = eval_dataset["test"].to_pandas()
eval_dataset.head()

In this example, the dataset contains the question and ground_truth attributes, which are essential for evaluating the RAG pipeline. The ground_truth answers were created by the Ragas synthetic data generation feature ^[2].

Moreover, in the Ragas validation process, the handle_deprecated_ground_truths function ensures that the ground_truth column is correctly named and formatted:

def handle_deprecated_ground_truths(ds: Dataset) -> Dataset:
    if "ground_truths" in ds.features and "ground_truth" not in ds.features:
        column_names = "ground_truths"
        if (
            isinstance(ds.features[column_names], Sequence)
            and ds.features[column_names].feature.dtype == "string"
        ):
            logger.warning(
                "passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`"
            )
            gt = [gt[0] for gt in ds["ground_truths"]]
            ds = ds.add_column(
                "ground_truth",
                gt,
                new_fingerprint=ds._fingerprint
                + "a",  # adding random to fingerprint to avoid caching
            )
    return ds

This function checks if the dataset uses the deprecated ground_truths column and converts it to the ground_truth column, ensuring compatibility and correct formatting ^[3].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu for my documents, how RAGAS generate ground_truth and on what basis?

dosubot[bot] commented 1 month ago

RAGAS generates ground_truth for your documents by creating synthetic Question/Context/Ground_Truth samples using OpenAI models. The process involves loading a collection of documents, which are then used as a foundation for generating these samples. The TestsetGenerator from RAGAS is used to generate the test set, which includes the ground_truth data.

Here is the relevant code snippet:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from llama_index import SimpleDirectoryReader

# load documents
dir_path = "./prompt-engineering-papers"
reader = SimpleDirectoryReader(dir_path, num_files_limit=2)
documents = reader.load_data()

# generator with openai models
generator = TestsetGenerator.with_openai()

# set question type distribution
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents, test_size=10, distributions=distribution
)
test_df = testset.to_pandas()

The ground_truth is generated based on the distributions specified for different types of evolutions (simple, reasoning, multi_context) and the documents loaded into the system. RAGAS uses a unique evolution-based synthetic data generation paradigm to ensure the highest quality and diversity of questions generated ^[1].

Additionally, the correctness of the generated answers is evaluated using the AnswerCorrectness metric, which measures answer correctness compared to ground truth as a combination of factuality and semantic similarity ^[2].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu in the above answer i am not able to understand, how ground_truth is generated. Is there any prompt that RAGAS is using to create the ground_truth?if yes, where exactly in the code?

dosubot[bot] commented 1 month ago

Yes, RAGAS uses specific prompts to create the ground_truth. The prompts are defined in the TestsetGenerator class and are used to generate answers based on the provided documents.

Location in the Code:

The TestsetGenerator class is defined in src/ragas/testset/generator.py. The class uses various prompts for generating ground_truth answers, including the generate_answer_prompt.

Here is the relevant code snippet:

@dataclass
class TestsetGenerator:
    generator_llm: BaseRagasLLM
    critic_llm: BaseRagasLLM
    embeddings: BaseRagasEmbeddings
    docstore: DocumentStore

    generate_answer_prompt: Prompt = field(default_factory=lambda: question_answering)

    async def generate_answer(self, question: str, chunks: t.List[LCDocument]) -> t.Any:
        assert self.llm is not initialized
        text = "\n\n".join([chunk.page_content for chunk in chunks])
        output = await self.llm.generate(
            self.generate_answer_prompt.format(question=question, text=text)
        )
        return output.generations[0][0].text

Additional Prompts:

The TestsetGenerator class also uses other prompts for different types of questions, such as:

multi_context_question_prompt
conditional_question_prompt
compress_question_prompt
conversational_question_prompt
keyphrase_extraction_prompt
seed_question_prompt
main_topic_extraction_prompt
find_relevant_context_prompt
question_rewrite_prompt

These prompts are part of the testset_prompts list used in the class:

testset_prompts = [
    reasoning_question_prompt,
    multi_context_question_prompt,
    conditional_question_prompt,
    compress_question_prompt,
    conversational_question_prompt,
    question_answer_prompt,
    keyphrase_extraction_prompt,
    seed_question_prompt,
    main_topic_extraction_prompt,
    find_relevant_context_prompt,
    question_rewrite_prompt,
    context_scoring_prompt,
    filter_question_prompt,
    evolution_elimination_prompt,
]

These prompts guide the generation of ground_truth answers by providing structured templates for the language models to follow ^[1]^[2].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu which of these prompt are specifically used for ground_truth generation?

dosubot[bot] commented 1 month ago

The prompt specifically used for ground_truth generation in RAGAS is the long_form_answer_prompt ^[1].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu : For my document, i found that for many questions the the ground_truth value is coming as "The answer to given question is not present in context". After checking the code, i found that this ground_truth value is coming from the below prompt. However, in my case, the information is available in the context but still the verdict is getting -1 value. How can i handle this scenario?

question_answer_prompt = Prompt(
    name="answer_formulate",
    instruction="""Answer the question using the information from the given context. Output verdict as '1' if answer is present '-1' if answer is not present in the context.""",
    output_format_instruction=get_json_format_instructions(AnswerFormat),
    examples=[
        {
            "context": """Climate change is significantly influenced by human activities, notably the emission of greenhouse gases from burning fossil fuels. The increased greenhouse gas concentration in the atmosphere traps more heat, leading to global warming and changes in weather patterns.""",
            "question": "How do human activities contribute to climate change?",
            "answer": AnswerFormat.parse_obj(
                {
                    "answer": "Human activities contribute to climate change primarily through the emission of greenhouse gases from burning fossil fuels. These emissions increase the concentration of greenhouse gases in the atmosphere, which traps more heat and leads to global warming and altered weather patterns.",
                    "verdict": "1",
                }
            ).dict(),
        },
        {
            "context": """The concept of artificial intelligence (AI) has evolved over time, but it fundamentally refers to machines designed to mimic human cognitive functions. AI can learn, reason, perceive, and, in some instances, react like humans, making it pivotal in fields ranging from healthcare to autonomous vehicles.""",
            "question": "What are the key capabilities of artificial intelligence?",
            "answer": AnswerFormat.parse_obj(
                {
                    "answer": "Artificial intelligence is designed to mimic human cognitive functions, with key capabilities including learning, reasoning, perception, and reacting to the environment in a manner similar to humans. These capabilities make AI pivotal in various fields, including healthcare and autonomous driving.",
                    "verdict": "1",
                }
            ).dict(),
        },
        {
            "context": """The novel "Pride and Prejudice" by Jane Austen revolves around the character Elizabeth Bennet and her family. The story is set in the 19th century in rural England and deals with issues of marriage, morality, and misconceptions.""",
            "question": "What year was 'Pride and Prejudice' published?",
            "answer": AnswerFormat.parse_obj(
                {
                    "answer": "The answer to given question is not present in context",
                    "verdict": "-1",
                }
            ).dict(),
        },
    ],
    input_keys=["context", "question"],
    output_key="answer",
    output_type="json",
    language="english",
)

dosubot[bot] commented 1 month ago

To handle the scenario where the ground_truth value is incorrectly coming as "The answer to given question is not present in context" even though the information is available in the context, you need to ensure that the context provided to the model is correctly formatted and comprehensive. Here is a possible approach to address this issue:

Verify Context Completeness: Ensure that the context provided to the model includes all necessary information to answer the question.
Check Prompt Template: Make sure the prompt template used to generate the answer is correctly structured and includes the context.
Evaluate the Model's Output: Use the evaluation framework to check if the model's output aligns with the expected ground_truth.

Here is an example of how you can ensure the context is correctly passed and evaluated:

def llm_chain_factory() -> Runnable:
    # just LLM
    template = """Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    Use three sentences maximum and keep the answer as concise as possible.
    Always say "thanks for asking!" at the end of the answer.

    Context: {context}

    Question: {question}

    Helpful Answer:"""
    llm_prompt = PromptTemplate.from_template(template)

    # LLM
    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

    # just llm pipeline
    just_llm = (
        {"question": RunnablePassthrough(), "context": RunnablePassthrough()}
        | llm_prompt
        | llm
        | StrOutputParser()
        | RunnableParallel(
            {
                "answer": RunnablePassthrough(),
                "contexts": RunnableLambda(lambda _: [""]),
            }
        )
    )

    return just_llm

@pytest.mark.e2e()
def test_langsmith_evaluate(langsmith_dataset):
    # setup
    just_llm = llm_chain_factory()

    from ragas.integrations.langsmith import evaluate
    from ragas.metrics import answer_correctness, context_recall

    # evaluate just llms
    _ = evaluate(
        dataset_name=langsmith_dataset,
        llm_or_chain_factory=just_llm,
        # experiment_name="just_llm",
        metrics=[answer_correctness, context_recall],
        verbose=True,
    )

In this example, the llm_chain_factory function is modified to ensure that both the question and context are passed to the prompt template. This ensures that the model has access to the necessary context to generate the correct answer.

Additionally, the answer_correctness and context_recall metrics from the ragas framework can be used to evaluate the correctness of the model's output. The context_recall metric measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. This helps in identifying discrepancies and ensuring that the model's output aligns with the expected ground truth ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

EnesKutay commented 1 month ago

Hey, I was also having the same problem and I solved it by changing the examples in the question_answer_prompt with the examples I created from my data by the following code:

simple.question_answer_prompt.examples = custom_examples

here simple can be one of the evaluation methods such as reasoning or multi_context

adityamity commented 1 month ago

@dosu From where does RAGAS get contexts in the testset using TestSetGenerator. Also as the contexts column contain list of context, what is the length of this list and where does it defined?

dosubot[bot] commented 1 month ago

RAGAS gets contexts in the test set using the TestSetGenerator by loading documents and then generating synthetic Question/Context/Ground_Truth samples. The contexts are derived from the loaded documents and are used to create the test set.

Context Generation:

The contexts are generated based on the documents provided to the TestSetGenerator. The length of the context list and its definition are managed within the generate_with_langchain_docs or generate_with_llamaindex_docs methods.

Code Snippet:

Here is an example of how contexts are generated and included in the test set:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from llama_index import SimpleDirectoryReader

# Load documents
dir_path = "./prompt-engineering-papers"
reader = SimpleDirectoryReader(dir_path, num_files_limit=2)
documents = reader.load_data()

# Generator with OpenAI models
generator = TestsetGenerator.with_openai()

# Set question type distribution
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# Generate test set
testset = generator.generate_with_llamaindex_docs(
    documents, test_size=10, distributions=distribution
)
test_df = testset.to_pandas()

Context List Length:

The length of the context list is determined by the specific evolution method used (e.g., simple, reasoning, multi_context). Each method may define how many contexts are included based on the complexity and type of question being generated.

Example of Context Handling:

In the TestsetGenerator class, the contexts are handled as follows:

@dataclass
class TestsetGenerator:
    # Other attributes and methods...

    async def generate_answer(self, question: str, chunks: t.List[LCDocument]) -> t.Any:
        text = "\n\n".join([chunk.page_content for chunk in chunks])
        output = await self.llm.generate(
            self.generate_answer_prompt.format(question=question, text=text)
        )
        return output.generations[0][0].text

In this method, chunks represent the list of contexts, and their length is determined by the specific evolution method and the document loading process.

Customizing Contexts:

You can customize the contexts by modifying the examples used in the prompts, as mentioned by another user:

simple.question_answer_prompt.examples = custom_examples

This allows you to tailor the context generation to better fit your specific data and requirements.

Additional Details:

The length of the list in the contexts column is defined by the test_size parameter in the generate method. Here is the relevant code snippet:

current_nodes = [
    CurrentNodes(root_node=n, nodes=[n])
    for n in self.docstore.get_random_nodes(k=test_size)
]

This code shows that test_size determines the number of contexts (nodes) retrieved from the docstore ^[1].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu - The testset generated from TestSetGenerator class contains columns'question', 'contexts', 'ground_truth','evolution_type','metadata','episode_done'. 1 - I want to know from where does this 'contexts' column values come from. 2 - As the 'contexts' columns contains list of context, how and where in ragas code, the length of this list is defined.

dosubot[bot] commented 1 month ago

The 'contexts' column values in the test set generated from the TestsetGenerator class come from the current_nodes list, which is populated by the self.docstore.get_random_nodes(k=test_size) method call. This method retrieves a random set of nodes from the document store.

The length of the 'contexts' list is defined by the test_size parameter passed to the generate method of the TestsetGenerator class. Specifically, the length is determined by the number of nodes retrieved, which is equal to the test_size value.

Here is the relevant code snippet:

current_nodes = [
    CurrentNodes(root_node=n, nodes=[n])
    for n in self.docstore.get_random_nodes(k=test_size)
]

This code indicates that the length of the current_nodes list, and consequently the 'contexts' column values, is directly defined by the test_size parameter ^[1].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu In the below code , what is the use of test_size paramater. I have given test_size =5, what does this means?

generator = TestsetGenerator.from_llama_index(
    generator_llm =llm_model,
    critic_llm = llm_model,
    embeddings = embed_model,
)
testset = generator.generate_with_llamaindex_docs(
    nodes,
    test_size=5,
    distributions = {
    "simple": 0.5,
    "multi_context": 0.4,
    "reasoning": 0.1
            }
)

dosubot[bot] commented 1 month ago

The test_size parameter in the generate_with_llamaindex_docs method specifies the number of test cases to generate. When you set test_size=5, it means that the method will generate 5 test cases.

Here is the relevant code snippet:

generator = TestsetGenerator.from_llama_index(
    generator_llm=llm_model,
    critic_llm=llm_model,
    embeddings=embed_model,
)
testset = generator.generate_with_llamaindex_docs(
    nodes,
    test_size=5,
    distributions={
        "simple": 0.5,
        "multi_context": 0.4,
        "reasoning": 0.1
    }
)

In this context, test_size=5 means that the method will randomly select 5 nodes from the provided nodes and use them to generate the test cases according to the specified distributions ^[1]^[2].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu is the above test_size is also used when generating the list of context in the contexts column?

dosubot[bot] commented 1 month ago

Yes, the test_size parameter is also used when generating the list of contexts in the contexts column. Specifically, it is used to determine the number of random nodes to retrieve from the document store:

current_nodes = [
    CurrentNodes(root_node=n, nodes=[n])
    for n in self.docstore.get_random_nodes(k=test_size)
]

This code snippet shows that test_size is used as the argument k in the get_random_nodes method, which retrieves test_size number of random nodes from the document store. Additionally, the test_size parameter is also used in the generate_with_langchain_docs method to specify the number of test cases to generate:

testset = test_generator.generate_with_langchain_docs(
    documents=documents, test_size=10, distributions=distributions
)

This further confirms that test_size plays a crucial role in determining the number of contexts generated ^[1]^[2].

To continue talking to Dosu, mention @dosu.

dosubot[bot] commented 1 month ago

The contexts column is created in the _create_single_turn_prompt method of the RubricsScoreWithoutReference class. Here is the relevant code snippet:

def _create_single_turn_prompt(
        self, row: t.Dict
    ) -> SingleTurnWithoutReferenceInput:
        question, contexts, answer = (
            row["user_input"],
            row.get("retrieved_contexts"),
            row["response"],
        )
        if contexts:
            contexts = "\n".join(contexts)
            question = f"{question} answer using context: {contexts}"

        return SingleTurnWithoutReferenceInput(
            user_input=question,
            response=answer,
            rubrics=self.rubrics,
        )

Additionally, the contexts column is also referenced in the test_column_remap_omit function within the tests/unit/test_validation.py file. Here is the relevant code snippet:

def test_column_remap_omit():
    TEST_DATASET = Dataset.from_dict(
        {
            "query": [""],
            "answer": [""],
            "contexts": [[""]],
        }
    )

    column_map = {
        "question": "query",
        "contexts": "contexts",
        "answer": "answer",
    }

    remapped_dataset = remap_column_names(TEST_DATASET, column_map)
    assert remapped_dataset.column_names == ["question", "answer", "contexts"]

These snippets show where and how the contexts column is created and referenced in the code ^[1]^[2].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu The testset generated from TestSetGenerator class contains columns 'question', 'contexts', 'ground_truth','evolution_type','metadata','episode_done'.Here the contexts columns is generated from the nodes, but does ragas used any pormpt for generating these contexts column?

dosubot[bot] commented 1 month ago

Yes, RAGAS uses a prompt for generating the contexts column in the test set created by the TestSetGenerator class. Specifically, the find_relevant_context_prompt is used for this purpose.

Here is the relevant prompt definition:

find_relevant_context_prompt = Prompt(
    name="find_relevant_context",
    instruction="Given a question and set of contexts, find the most relevant contexts to answer the question.",
    examples=[
        {
            "question": "What is the capital of France?",
            "contexts": [
                "1. France is a country in Western Europe. It has several cities, including Paris, Lyon, and Marseille. Paris is not only known for its cultural landmarks like the Eiffel Tower and the Louvre Museum but also as the administrative center.",
                "2. The capital of France is Paris. It is also the most populous city in France, with a population of over 2 million people. Paris is known for its cultural landmarks like the Eiffel Tower and the Louvre Museum.",
                "3. Paris is the capital of France. It is also the most populous city in France, with a population of over 2 million people. Paris is known for its cultural landmarks like the Eiffel Tower and the Louvre Museum.",
            ],
            "output": {
                "relevant_contexts": [1, 2],
            },
        },
        {
            "question": "How does caffeine affect the body and what are its common sources?",
            "contexts": [
                "1. Caffeine is a central nervous system stimulant. It can temporarily ward off drowsiness and restore alertness. It primarily affects the brain, where it alters the function of neurotransmitters.",
                "2. Regular physical activity is essential for maintaining good health. It can help control weight, combat health conditions, boost energy, and promote better sleep.",
                "3. Common sources of caffeine include coffee, tea, cola, and energy drinks. These beverages are consumed worldwide and are known for providing a quick boost of energy.",
            ],
            "output": {"relevant_contexts": [1, 2]},
        },
    ],
    input_keys=["question", "contexts"],
    output_key="output",
    output_type="json",
    language="english",
)

This prompt is used within the generate_datarow method of the TestSetGenerator class to generate the contexts column:

results = await self.generator_llm.generate(
    prompt=self.find_relevant_context_prompt.format(
        question=question, contexts=node_content
    )
)

This ensures that the most relevant contexts are selected based on the provided question ^[1]^[2].

To continue talking to Dosu, mention @dosu.