How to prompt ragas data generation?

atr-ip commented 3 months ago

Does anyone know how to prompt data generation? I would need specific sentence structures. I tried to reconfigure the prompt itself, but it did not change the generated data sentence by sentence. Does anyone know how to implement this correctly?

       from ragas.llms.prompt import Prompt

        Prompt(
            name="question_generation",
            instruction="Generate a question for the given answer",
            examples=[
                {
                    "answer": "The last Olympics was held in Tokyo, Japan.",
                    "context": "The last Olympics was held in Tokyo, Japan. It is held every 4 years",
                    "output": {"question": "Where was the last Olympics held?"},
                },
                {
                    "answer": "It can change its skin color based on the temperature of its environment.",
                    "context": "A recent scientific study has discovered a new species of frog in the Amazon rainforest that has the unique ability to change its skin color based on the temperature of its environment.",
                    "output": {"question": "What unique ability does the newly discovered species of frog have?"},
                }
            ],
            input_keys=["answer", "context"],
            output_key="output",
            output_type="json",
        )

        generator.adapt(language=self.language, evolutions=[simple, reasoning, multi_context], cache_dir=".cache")

        generator.generate_with_llamaindex_docs(
            documents=documents,
            test_size=number_of_questions,
            distributions={
                simple: 0.5,
                reasoning: 0.25,
                multi_context: 0.25,
            },
            with_debugging_logs=True,
            raise_exceptions=False
        )

dosubot[bot] commented 3 months ago

Hey @atr-ip! I'm here to help you with any issues or questions you have. Let's get to the bottom of this together.

To ensure the generated data changes sentence by sentence using the ragas library, you need to create a custom Prompt object and use it in your evaluation metric. Here's how you can do it:

Create a Custom Prompt Object: Define a new Prompt object with the desired instructions and examples.

from ragas.llms.prompt import Prompt

long_form_answer_prompt_new = Prompt(
   name="long_form_answer_new_v1",
   instruction="Create one or more statements from each sentence in the given answer.",
   examples=[
       {
           "question": "Which is the only planet in the solar system that has life on it?",
           "answer": "earth",
           "statements": {
               "statements": [
                   "Earth is the only planet in the solar system that has life on it."
               ]
           },
       },
       {
           "question": "Were Hitler and Benito Mussolini of the same nationality?",
           "answer": "Sorry, I can't provide an answer to that question.",
           "statements": {
               "statements": []
           },
       },
   ],
   input_keys=["question", "answer"],
   output_key="statements",
   output_type="json",
)

Use the Custom Prompt in Evaluations: Replace the default prompt in the faithfulness metric with your custom prompt.

from ragas.metrics import faithfulness

faithfulness.long_form_answer_prompt = long_form_answer_prompt_new
print(faithfulness.long_form_answer_prompt.to_string())

Evaluate the Dataset: Use the evaluate function to evaluate your dataset with the new prompt configuration.

from ragas import evaluate

result = evaluate(
   dataset["train"].select(range(3)),  # selecting only 3
   metrics=[
       faithfulness
   ],
)

result

By following these steps, you ensure that the generated data changes sentence by sentence as specified in the custom prompt's instructions ^[1].

To continue talking to Dosu, mention @dosu.

atr-ip commented 3 months ago

this did not help, because I do not want to prompt the metrics, but the generated data itself

dosubot[bot] commented 3 months ago

To ensure that the generated data changes sentence by sentence according to specific sentence structures using the ragas library, you can use the TEXT_EXTRACT_KEYPHRASES, TEXT_GENERATE_QUESTIONS, and TEXT_GENERATE_ANSWERS prompts. These prompts are designed to extract key phrases, generate questions, and generate answers based on the given text, ensuring that the generated data adheres to specific sentence structures.

Here's an example of how you can implement this:

from ragas.llms.prompt import Prompt
from ragas.llms.output_parser import RagasoutputParser, get_json_format_instructions
from ragas.metrics._summarization import (
    TEXT_EXTRACT_KEYPHRASES,
    TEXT_GENERATE_QUESTIONS,
    TEXT_GENERATE_ANSWERS,
    SummarizationScore,
)

# Example text
text = """JPMorgan Chase & Co. is an American multinational finance company headquartered in New York City. It is the largest bank in the United States and the world's largest by market capitalization as of 2023. Founded in 1799, it is a major provider of investment banking services, with US$3.9 trillion in total assets, and ranked #1 in the Forbes Global 2000 ranking in 2023."""

# Define the prompts
extract_keyphrases_prompt = TEXT_EXTRACT_KEYPHRASES.format(text=text)
generate_questions_prompt = TEXT_GENERATE_QUESTIONS.format(
    text=text,
    keyphrases=[
        "JPMorgan Chase & Co.",
        "American multinational finance company",
        "headquartered in New York City",
        "largest bank in the United States",
        "world's largest bank by market capitalization",
        "founded in 1799",
        "major provider of investment banking services",
        "US$3.9 trillion in total assets",
        "ranked #1 in Forbes Global 2000 ranking",
    ],
)
generate_answers_prompt = TEXT_GENERATE_ANSWERS.format(
    summary="""JPMorgan Chase & Co., headquartered in New York City, is the largest bank in the US and the world's largest by market capitalization as of 2023. Founded in 1799, it offers extensive investment, private, asset management, and retail banking services, and has $3.9 trillion in assets, making it the fifth-largest bank globally. It operates the world's largest investment bank by revenue and was ranked #1 in the 2023 Forbes Global 2000.""",
    questions=[
        "Is JPMorgan Chase & Co. an American multinational finance company?",
        "Is JPMorgan Chase & Co. headquartered in New York City?",
        "Is JPMorgan Chase & Co. the largest bank in the United States?",
        "Is JPMorgan Chase & Co. the world's largest bank by market capitalization as of 2023?",
        "Was JPMorgan Chase & Co. founded in 1799?",
        "Is JPMorgan Chase & Co. a major provider of investment banking services?",
        "Does JPMorgan Chase & Co. have US$3.9 trillion in total assets?",
        "Was JPMorgan Chase & Co. ranked #1 in the Forbes Global 2000 ranking in 2023?",
    ],
)

# Assuming you have an LLM instance to generate the output
# llm = YourLLMInstance()
# keyphrases_response = llm.generate(extract_keyphrases_prompt)
# questions_response = llm.generate(generate_questions_prompt)
# answers_response = llm.generate(generate_answers_prompt)

# Parse the responses
# keyphrases = RagasoutputParser(pydantic_object=ExtractKeyphrasesResponse).parse(keyphrases_response)
# questions = RagasoutputParser(pydantic_object=GenerateQuestionsResponse).parse(questions_response)
# answers = RagasoutputParser(pydantic_object=GenerateAnswersResponse).parse(answers_response)

This code sets up the prompts for extracting key phrases, generating questions, and generating answers. You would need to use an LLM instance to generate the output and then parse the responses using RagasoutputParser. This approach ensures that the generated data changes sentence by sentence according to specific sentence structures ^[1].

To continue talking to Dosu, mention @dosu.

harvey1992 commented 2 months ago

@dosu is there a way a customize the prompt using TestsetGenerator.with_openai() and generator.generate_with_langchain_docs()?

jjmachan commented 2 months ago

do keep a track of #1016 will make testset generation more customizable there 🙂

explodinggradients / ragas

How to prompt ragas data generation? #1077