What

One downstream application of LLMs is to auto-generate an initial list of question-answer pairs from a given scientific document. Currently, the attempt is made through codebase

We want to evaluate the "quality" of questions that are generated from a given text document.

Why

We want a zero-shot (or maybe few-shot) out-of-box evaluation using LLMs such as GPT-3.5/4 or Anthropic AI's claude or LLama in general. The evaluation could possibly help in accelerated annotation for generative QA tasks.

How

We can use langchain to help achieve that with proper prompt engineering.

An initial dummy attempt using GPT-3.5/4 is using simple prompt hack for just evaluating questions (not answers) such as:

You are an expert scientist grading the quality of questions from scientific documents.

You are given a scientific document text and a question and are asked to score the question as GOOD or BAD. 

Example format:
DOCUMENT: document here
QUESTION: question here
GRADE: GOOD or BAD here

Grade the quality of the question based only on their factual accuracy in the provided document text. 
Grade the simple factoid questions as BAD if they are "wh" questions (such as what/how, etc.).
Grade the questions if they are very simple such as wh questions.
Grade the questions as GOOD if they are complex and can be answered from the given document. Begin!

DOCUMENT: "Menu National Snow and Ice Data Center NSIDC a part of CIRES at the University of Colorado Boulder Skip to main content Main navigation News & Analyses News & Stories Scientific Analyses About our Analyses Snow Today Greenland Today & Antarctic Ice Sheet Today Arctic Sea Ice News & Analysis (ASINA) Multimedia Data Explore Data Visualize Data Submit Data Submit NASA Data to NSIDC DAAC Submit Data to Other NSIDC Programs User Resources Get Started with Data Data Announcements Help Center Data Tools Documents Levels of service NASA Earthdata Forum Data Programs About our Programs NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC) NOAA at NSIDC Exchange for Observations and Local Knowledge of the Arctic (ELOKA) Data Policies Our Research Learn What is the Cryosphere? Parts of the Cryosphere Arctic Weather & Climate Frozen Ground & Permafrost Glaciers Ice Sheets Ice Shelves Sea Ice Snow Ask a Scientist Cryosphere glossary About About NSIDC What we do Our People Published Research Our History Diversity, Equity & Inclusion Careers For the Media Contact Us Citation Policies Web Policy Land Acknowledgement Search News & Analyses News & Stories Scientific Analyses About our Analyses Snow Today Greenland Today & Antarctic Ice Sheet Today Arctic Sea Ice News & Analysis (ASINA) Multimedia Data Explore Data Visualize Data Submit Data Submit NASA Data to NSIDC DAAC Submit Data to Other NSIDC Programs User Resources Get Started with Data Data Announcements Help Center Data Tools Documents Levels of service NASA Earthdata Forum Data Programs About our Programs NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC) NOAA at NSIDC Exchange for Observations and Local Knowledge of the Arctic (ELOKA) Data Policies Our Research Learn What is the Cryosphere? Parts of the Cryosphere Arctic Weather & Climate Frozen Ground"

QUESTION: What is NSIDC?

GRADE:

To achieve this, we could have a new QA evaluation component like LangChainBasedQAEvaluator where we can provide prompt templates. Something like:


from evalem.evaluators import LangChainBasedQAEvaluator

qa_data = [dict(context=<CONTEXT>, question=<QUESTION>, answer=<answer>), ...]

evaluator = LangChainBasedQAEvaluator(prompt=<PROMPT>, llm=<MAYBE_OPENAI>)
res = evaluator(qa_data, references)

Or instead of actual evaluator, this could just be a lang-chain based metric that says 0(BAD) and 1 (GOOD) and compute the GOOD-ness of generated questions.

from evalem.metrics import LangChainBasedQuestionQualityMetric

inputs = [dict(context=<CONTEXT>, question=<QUESTION>, answer=<answer>), ...]

metric = LangChainBasedQuestionQualityMetric(...)
res = metric(inputs, references)
...

References

cc: @muthukumaranR @xhagrg

NASA-IMPACT / evalem

[feature request] Langchain-based quality evaluation for generated QA pairs #20

What

Why

How

References