Synthetic dataset generation produces undesired Introductory Phrases with Bedrock LLM

CosaroLisa commented 2 months ago

[x] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question Hi, I'm trying to use this library for generating a synthetic dataset using the LLM and Bedrock embeddings, according to the section of the documentation titled: 'Using Amazon Bedrock'. What I'm noticing is that most of the generated questions have this form: 'Based on the feedback, here is a rewritten version of the question that incorporates the relevant context:..', or 'Based on the given context and keyphrase, here is a potential question:...', 'Based on the given context and keyphrase "X", a relevant question could be:.....'. Could someone help me understand how to avoid this behavior, so that I can obtain questions in a direct form, without these 'introductory parts'?

dosubot[bot] commented 2 months ago

Hello, @CosaroLisa! I'm here to assist you with any bugs, questions, or contributions while you wait for a human maintainer.

To avoid introductory phrases in the generated questions when using the Ragas library with Amazon Bedrock embeddings, you can leverage the built-in prompts that enforce rules against such phrases. Specifically, the reasoning_question_prompt, multi_context_question_prompt, and other prompts in the Ragas library include instructions to avoid phrases like "based on the provided context" or "according to the context."

Here is an example of how you might achieve this:

from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings.base import LangchainEmbeddingsWrapper
from ragas.testset.extractor import KeyphraseExtractor
from langchain.text_splitter import TokenTextSplitter
from ragas.testset.docstore import InMemoryDocumentStore
from ragas.testset.prompts import reasoning_question_prompt, multi_context_question_prompt

# Wrap the Bedrock model and embeddings
bedrock_model = LangchainLLMWrapper(bedrock_model)
bedrock_embeddings = LangchainEmbeddingsWrapper(bedrock_embeddings)

# Initialize the document store
splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)
keyphrase_extractor = KeyphraseExtractor(llm=bedrock_model)
docstore = InMemoryDocumentStore(
    splitter=splitter,
    embeddings=bedrock_embeddings,
    extractor=keyphrase_extractor,
)

# Initialize the TestsetGenerator with specific prompts
test_generator = TestsetGenerator(
    generator_llm=bedrock_model,
    critic_llm=bedrock_model,
    embeddings=bedrock_embeddings,
    docstore=docstore,
    prompts=[reasoning_question_prompt, multi_context_question_prompt]
)

# Define the distributions for the test set
distributions = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# Generate the test set
testset = test_generator.generate_with_langchain_docs(
    documents=documents, test_size=10, distributions=distributions
)

# Export the results into pandas
test_df = testset.to_pandas()
test_df.head()

This code snippet demonstrates how to use the built-in prompts in the Ragas library that already include rules to avoid introductory phrases, ensuring that the generated questions are clear and concise. Additionally, you can customize the LLM and embeddings by subclassing BaseRagasLLM and BaseRagasEmbeddings or using the LangchainLLMWrapper and LangchainEmbeddingsWrapper as needed ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

CosaroLisa commented 2 months ago

@dosu

TestsetGenerator(
    generator_llm=bedrock_model,
    critic_llm=bedrock_model,
    embeddings=bedrock_embeddings,
    docstore=docstore,
    prompts=[reasoning_question_prompt, multi_context_question_prompt]
)

I think passing prompts as argument in the TestsetGenerator constructor is not allowed as of now.

dosubot[bot] commented 2 months ago

To correctly implement the TestsetGenerator without passing prompts directly in the constructor, you can use the predefined prompts that already include instructions to avoid introductory phrases. Here's how you can achieve this:

Use the predefined prompts like reasoning_question_prompt and multi_context_question_prompt.
Ensure that these prompts are used internally within the TestsetGenerator.

Here's an example of how to set this up:

from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings.base import LangchainEmbeddingsWrapper
from ragas.testset.extractor import KeyphraseExtractor
from langchain.text_splitter import TokenTextSplitter
from ragas.testset.docstore import InMemoryDocumentStore
from ragas.testset.prompts import reasoning_question_prompt, multi_context_question_prompt

# Wrap the Bedrock model and embeddings
bedrock_model = LangchainLLMWrapper(bedrock_model)
bedrock_embeddings = LangchainEmbeddingsWrapper(bedrock_embeddings)

# Initialize the document store
splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)
keyphrase_extractor = KeyphraseExtractor(llm=bedrock_model)
docstore = InMemoryDocumentStore(
    splitter=splitter,
    embeddings=bedrock_embeddings,
    extractor=keyphrase_extractor,
)

# Initialize the TestsetGenerator without passing prompts directly
test_generator = TestsetGenerator(
    generator_llm=bedrock_model,
    critic_llm=bedrock_model,
    embeddings=bedrock_embeddings,
    docstore=docstore
)

# Define the distributions for the test set
distributions = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# Generate the test set
testset = test_generator.generate_with_langchain_docs(
    documents=documents, test_size=10, distributions=distributions
)

# Export the results into pandas
test_df = testset.to_pandas()
test_df.head()

This setup ensures that the TestsetGenerator uses the predefined prompts internally, which include rules to avoid introductory phrases, ensuring that the generated questions are clear and concise ^[1].

To continue talking to Dosu, mention @dosu.

CosaroLisa commented 2 months ago

After consulting the source code in ragas/testset/prompts.py, I noticed that some metaprompt instructions specify not to use the above introductory expression, but not all of them—for example, the seed_question_prompt or question_rewrite_prompt. This might be why I still get most of the questions beginning with 'Based on the feedback...' or similar sentences, can anyone help me out on that ?

dosubot[bot] commented 2 months ago

To ensure that the seed_question_prompt and question_rewrite_prompt avoid introductory phrases like "Based on the feedback...", you need to update their instructions explicitly. Here are the modified versions of these prompts:

seed_question_prompt = Prompt(
    name="seed_question",
    instruction="""Generate a question that can be fully answered from the given context. The question should be formed using the topic.
    Ensure that the question does not include introductory phrases like 'Based on the feedback...', 'According to the context...', etc.""",
    examples=[
        {
            "context": "Photosynthesis in plants involves converting light energy into chemical energy, using chlorophyll and other pigments to absorb light. This process is crucial for plant growth and the production of oxygen.",
            "keyphrase": "Photosynthesis",
            "question": "What is the role of photosynthesis in plant growth?",
        },
        {
            "context": "The Industrial Revolution, starting in the 18th century, marked a major turning point in history as it led to the development of factories and urbanization.",
            "keyphrase": "Industrial Revolution",
            "question": "How did the Industrial Revolution mark a major turning point in history?",
        },
        {
            "context": "The process of evaporation plays a crucial role in the water cycle, converting water from liquid to vapor and allowing it to rise into the atmosphere.",
            "keyphrase": "Evaporation",
            "question": "Why is evaporation important in the water cycle?",
        },
    ],
    input_keys=["context", "keyphrase"],
    output_key="question",
    output_type="str",
)

question_rewrite_prompt = Prompt(
    name="rewrite_question",
    instruction="""Given a context, question, and feedback, rewrite the question to improve its clarity and answerability based on the feedback provided.
    Ensure that the rewritten question does not include introductory phrases like 'Based on the feedback...', 'According to the context...', etc.""",
    examples=[
        {
            "context": "The Eiffel Tower was constructed using iron and was originally intended as a temporary exhibit for the 1889 World's Fair held in Paris. Despite its initial temporary purpose, the Eiffel Tower quickly became a symbol of Parisian ingenuity and an iconic landmark of the city, attracting millions of visitors each year. The tower's design, created by Gustave Eiffel, was initially met with criticism from some French artists and intellectuals, but it has since been celebrated as a masterpiece of structural engineering and architectural design.",
            "question": "Who created the design for the Tower?",
            "feedback": "The question asks about the creator of the design for 'the Tower', but it does not specify which tower it refers to. There are many towers worldwide, and without specifying the exact tower, the question is unclear and unanswerable. To improve the question, it should include the name or a clear description of the specific tower in question.",
            "output": "Who created the design for the Eiffel Tower?",
        },
        {
            "context": "'Exploring Zero-Shot Learning in Neural Networks' was published by Smith and Lee in 2021, focusing on the application of zero-shot learning techniques in artificial intelligence.",
            "question": "What datasets were used for the zero-shot evaluations in this study?",
        },
    ],
    input_keys=["context", "question", "feedback"],
    output_key="output",
    output_type="str",
)

These modifications ensure that the generated questions are clear and direct, avoiding any unnecessary introductory phrases ^[1]^[2].

To continue talking to Dosu, mention @dosu.

explodinggradients / ragas

Synthetic dataset generation produces undesired Introductory Phrases with Bedrock LLM #1179