Closed knpunk closed 2 months ago
The Critic LLM functions as a quality control mechanism in the process of generating question, context, and ground truth triads from a PDF. It evaluates the generated questions to ensure they meet a certain quality standard before they are used further in the pipeline.
Here's how it works:
The relevant code snippet for the Critic LLM evaluation is:
async def critic_question(self, question: str) -> bool:
assert self.llm is not None, "LLM is not initialized"
output = await self.llm.generate(critic_question.format(question=question))
output = json.loads(output.generations[0][0].text)
return all(score >= 2 for score in output.values())
If the Critic LLM rejects a question, it logs a warning and the question is either modified or discarded. The modify_question
function is used to adjust the question according to specified style and length parameters, making it more likely to be accepted by the Critic LLM.
This integration ensures that only high-quality questions are generated and used in the pipeline [1][2].
To continue talking to Dosu, mention @dosu.
@dosu So if I'm using GPT 3.5 turbo for generation model, can I use the same as a critic? or does it need to be better?
Yes, you can use GPT-3.5 Turbo as both the generation model and the critic. The document does not specify that the critic needs to be a better model. In the provided example, different models are used for generation and criticism, but it is not stated as a requirement.
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
You can replace critic_llm
with gpt-3.5-turbo-16k
if you prefer to use the same model for both roles [1].
To continue talking to Dosu, mention @dosu.
@dosu Last thing, can you explain the whole process through a series of detailed examples on how both llms are generating a triads, just take any two example triads to explain me the procedure of those LLMs.
Sure, let's walk through the process of generating question, context, and ground truth triads using both the generation LLM and the Critic LLM with two example triads.
modify_question
function adjusts the question to be more specific.
The final accepted triads are stored in a TestDataset
object, which can be converted to a pandas DataFrame or a Hugging Face Dataset for further use.
@dataclass
class TestDataset:
test_data: List[DataRow]
def to_pandas(self) -> pd.DataFrame:
return pd.DataFrame.from_records(self._to_records())
def to_dataset(self) -> Dataset:
return Dataset.from_list(self._to_records())
This process ensures that only high-quality triads are generated and used for testing and POC purposes [1][2][3][4].
To continue talking to Dosu, mention @dosu.
@dosu how is it generating the answer? More importantly, since we're using the same model for generation, shouldn't the answer and ground truth be same?
The answer is generated by reverse-engineering 'n' variants of the question from the generated answer using a Large Language Model (LLM). The mean cosine similarity between these generated questions and the original question is then calculated to determine the answer relevancy.
The answer and ground truth are not necessarily the same because the Answer Relevancy metric does not compare the generated answer directly with the ground truth answer. Instead, it assesses how well the generated answer addresses the original question by generating new questions from the answer and comparing them to the original question. If you want to compare the generated answer with the ground truth answer, you should refer to the "answer_correctness" metric, which evaluates the accuracy of the generated answer by comparing it to the ground truth. This metric considers both semantic and factual similarities to provide a comprehensive correctness score. The "answer_correctness" metric classifies statements in the answer as true positive (TP), false positive (FP), or false negative (FN) based on their presence and support in the ground truth, providing a detailed analysis of the answer's correctness [1][2][3].
To continue talking to Dosu, mention @dosu.
@dosu didn't quite get the last one but I got a little gist of how it's working, I'll try connecting with @shahules786 for this confusion. Thanks :)
@dosu I had a pdf out of which I needed to make 50 triads of question, context and ground truth for testing and POC purposes. But while generating it, within the code, there is a generation LLM as well as a Critic LLM, now I suppose the generator LLM is generating all the content in the 3 triads, but then where and how does the critic LLM come in?