One downstream application of LLMs is to auto-generate an initial list of question-answer pairs from a given scientific document. Currently, the attempt is made through codebase
We want to evaluate the "quality" of questions that are generated from a given text document.
Why
We want a zero-shot (or maybe few-shot) out-of-box evaluation using LLMs such as GPT-3.5/4 or Anthropic AI's claude or LLama in general.
The evaluation could possibly help in accelerated annotation for generative QA tasks.
How
We can use langchain to help achieve that with proper prompt engineering.
An initial dummy attempt using GPT-3.5/4 is using simple prompt hack for just evaluating questions (not answers) such as:
You are an expert scientist grading the quality of questions from scientific documents.
You are given a scientific document text and a question and are asked to score the question as GOOD or BAD.
Example format:
DOCUMENT: document here
QUESTION: question here
GRADE: GOOD or BAD here
Grade the quality of the question based only on their factual accuracy in the provided document text.
Grade the simple factoid questions as BAD if they are "wh" questions (such as what/how, etc.).
Grade the questions if they are very simple such as wh questions.
Grade the questions as GOOD if they are complex and can be answered from the given document. Begin!
DOCUMENT: "Menu National Snow and Ice Data Center NSIDC a part of CIRES at the University of Colorado Boulder Skip to main content Main navigation News & Analyses News & Stories Scientific Analyses About our Analyses Snow Today Greenland Today & Antarctic Ice Sheet Today Arctic Sea Ice News & Analysis (ASINA) Multimedia Data Explore Data Visualize Data Submit Data Submit NASA Data to NSIDC DAAC Submit Data to Other NSIDC Programs User Resources Get Started with Data Data Announcements Help Center Data Tools Documents Levels of service NASA Earthdata Forum Data Programs About our Programs NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC) NOAA at NSIDC Exchange for Observations and Local Knowledge of the Arctic (ELOKA) Data Policies Our Research Learn What is the Cryosphere? Parts of the Cryosphere Arctic Weather & Climate Frozen Ground & Permafrost Glaciers Ice Sheets Ice Shelves Sea Ice Snow Ask a Scientist Cryosphere glossary About About NSIDC What we do Our People Published Research Our History Diversity, Equity & Inclusion Careers For the Media Contact Us Citation Policies Web Policy Land Acknowledgement Search News & Analyses News & Stories Scientific Analyses About our Analyses Snow Today Greenland Today & Antarctic Ice Sheet Today Arctic Sea Ice News & Analysis (ASINA) Multimedia Data Explore Data Visualize Data Submit Data Submit NASA Data to NSIDC DAAC Submit Data to Other NSIDC Programs User Resources Get Started with Data Data Announcements Help Center Data Tools Documents Levels of service NASA Earthdata Forum Data Programs About our Programs NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC) NOAA at NSIDC Exchange for Observations and Local Knowledge of the Arctic (ELOKA) Data Policies Our Research Learn What is the Cryosphere? Parts of the Cryosphere Arctic Weather & Climate Frozen Ground"
QUESTION: What is NSIDC?
GRADE:
To achieve this, we could have a new QA evaluation component like LangChainBasedQAEvaluator where we can provide prompt templates. Something like:
from evalem.evaluators import LangChainBasedQAEvaluator
qa_data = [dict(context=<CONTEXT>, question=<QUESTION>, answer=<answer>), ...]
evaluator = LangChainBasedQAEvaluator(prompt=<PROMPT>, llm=<MAYBE_OPENAI>)
res = evaluator(qa_data, references)
Or instead of actual evaluator, this could just be a lang-chain based metric that says 0(BAD) and 1 (GOOD) and compute the GOOD-ness of generated questions.
from evalem.metrics import LangChainBasedQuestionQualityMetric
inputs = [dict(context=<CONTEXT>, question=<QUESTION>, answer=<answer>), ...]
metric = LangChainBasedQuestionQualityMetric(...)
res = metric(inputs, references)
...
What
One downstream application of LLMs is to auto-generate an initial list of question-answer pairs from a given scientific document. Currently, the attempt is made through codebase
We want to evaluate the "quality" of questions that are generated from a given text document.
Why
We want a zero-shot (or maybe few-shot) out-of-box evaluation using LLMs such as GPT-3.5/4 or Anthropic AI's claude or LLama in general. The evaluation could possibly help in accelerated annotation for generative QA tasks.
How
We can use
langchain
to help achieve that with proper prompt engineering.An initial dummy attempt using GPT-3.5/4 is using simple prompt hack for just evaluating questions (not answers) such as:
To achieve this, we could have a new QA evaluation component like
LangChainBasedQAEvaluator
where we can provide prompt templates. Something like:Or instead of actual evaluator, this could just be a lang-chain based metric that says 0(BAD) and 1 (GOOD) and compute the GOOD-ness of generated questions.
References
cc: @muthukumaranR @xhagrg