Critic LLM in dataset generation

knpunk commented 2 months ago

@dosu I had a pdf out of which I needed to make 50 triads of question, context and ground truth for testing and POC purposes. But while generating it, within the code, there is a generation LLM as well as a Critic LLM, now I suppose the generator LLM is generating all the content in the 3 triads, but then where and how does the critic LLM come in?

dosubot[bot] commented 2 months ago

The Critic LLM functions as a quality control mechanism in the process of generating question, context, and ground truth triads from a PDF. It evaluates the generated questions to ensure they meet a certain quality standard before they are used further in the pipeline.

Here's how it works:

Generation: The generation LLM creates the initial questions based on the content of the PDF.
Evaluation: The Critic LLM then evaluates these questions using a predefined prompt to ensure they meet quality standards. If a question does not pass, it is either modified or discarded.

The relevant code snippet for the Critic LLM evaluation is:

async def critic_question(self, question: str) -> bool:
    assert self.llm is not None, "LLM is not initialized"
    output = await self.llm.generate(critic_question.format(question=question))
    output = json.loads(output.generations[0][0].text)
    return all(score >= 2 for score in output.values())

If the Critic LLM rejects a question, it logs a warning and the question is either modified or discarded. The modify_question function is used to adjust the question according to specified style and length parameters, making it more likely to be accepted by the Critic LLM.

This integration ensures that only high-quality questions are generated and used in the pipeline ^[1]^[2].

To continue talking to Dosu, mention @dosu.

knpunk commented 2 months ago

@dosu So if I'm using GPT 3.5 turbo for generation model, can I use the same as a critic? or does it need to be better?

dosubot[bot] commented 2 months ago

Yes, you can use GPT-3.5 Turbo as both the generation model and the critic. The document does not specify that the critic needs to be a better model. In the provided example, different models are used for generation and criticism, but it is not stated as a requirement.

generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")

You can replace critic_llm with gpt-3.5-turbo-16k if you prefer to use the same model for both roles ^[1].

To continue talking to Dosu, mention @dosu.

knpunk commented 2 months ago

@dosu Last thing, can you explain the whole process through a series of detailed examples on how both llms are generating a triads, just take any two example triads to explain me the procedure of those LLMs.

dosubot[bot] commented 2 months ago

Sure, let's walk through the process of generating question, context, and ground truth triads using both the generation LLM and the Critic LLM with two example triads.

Example 1

Step 1: Generation

Input PDF Content: Let's assume the PDF content is about "The impact of AI on healthcare."
Generation LLM: The generation LLM reads the PDF and generates the following triad:
- Question: "How has AI improved diagnostic accuracy in healthcare?"
- Contexts: ["AI algorithms analyze medical images with higher accuracy than human radiologists.", "Machine learning models predict patient outcomes more reliably."]
- Answer: "AI has improved diagnostic accuracy by analyzing medical images and predicting patient outcomes with higher reliability."
- Ground Truth: "AI algorithms analyze medical images with higher accuracy and predict patient outcomes more reliably."

Step 2: Critic Evaluation

Critic LLM: The Critic LLM evaluates the generated question based on predefined criteria such as coherence, correctness, and conciseness.
Evaluation Result: The Critic LLM finds the question to be coherent, correct, and concise, so it accepts the triad.

Example 2

Step 1: Generation

Input PDF Content: Let's assume the PDF content is about "The role of AI in financial markets."
Generation LLM: The generation LLM reads the PDF and generates the following triad:
- Question: "What are the benefits of AI in stock trading?"
- Contexts: ["AI models can predict stock price movements with high accuracy.", "Automated trading systems execute trades faster than human traders."]
- Answer: "AI benefits stock trading by predicting price movements and executing trades quickly."
- Ground Truth: "AI models predict stock price movements accurately and execute trades faster."

Step 2: Critic Evaluation

Critic LLM: The Critic LLM evaluates the generated question.
Evaluation Result: The Critic LLM finds the question to be too broad and lacking specificity. It rejects the question and logs a warning.

Step 3: Modification

Modify Question: The modify_question function adjusts the question to be more specific.
- Original Question: "What are the benefits of AI in stock trading?"
- Modified Question: "How do AI models improve the accuracy of stock price predictions?"
Re-evaluation: The Critic LLM re-evaluates the modified question and finds it acceptable.

Final Output

The final accepted triads are stored in a TestDataset object, which can be converted to a pandas DataFrame or a Hugging Face Dataset for further use.

@dataclass
class TestDataset:
    test_data: List[DataRow]

    def to_pandas(self) -> pd.DataFrame:
        return pd.DataFrame.from_records(self._to_records())

    def to_dataset(self) -> Dataset:
        return Dataset.from_list(self._to_records())

This process ensures that only high-quality triads are generated and used for testing and POC purposes ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

knpunk commented 2 months ago

@dosu how is it generating the answer? More importantly, since we're using the same model for generation, shouldn't the answer and ground truth be same?

dosubot[bot] commented 2 months ago

The answer is generated by reverse-engineering 'n' variants of the question from the generated answer using a Large Language Model (LLM). The mean cosine similarity between these generated questions and the original question is then calculated to determine the answer relevancy.

The answer and ground truth are not necessarily the same because the Answer Relevancy metric does not compare the generated answer directly with the ground truth answer. Instead, it assesses how well the generated answer addresses the original question by generating new questions from the answer and comparing them to the original question. If you want to compare the generated answer with the ground truth answer, you should refer to the "answer_correctness" metric, which evaluates the accuracy of the generated answer by comparing it to the ground truth. This metric considers both semantic and factual similarities to provide a comprehensive correctness score. The "answer_correctness" metric classifies statements in the answer as true positive (TP), false positive (FP), or false negative (FN) based on their presence and support in the ground truth, providing a detailed analysis of the answer's correctness ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

knpunk commented 2 months ago

@dosu didn't quite get the last one but I got a little gist of how it's working, I'll try connecting with @shahules786 for this confusion. Thanks :)

explodinggradients / ragas