Research Factual Accuracy Metric

Research Report on Assessing Factual Accuracy of LLM Responses

Objective The objective of this research is to identify methods for assessing the factual accuracy of responses generated by Large Language Models (LLMs). This includes exploring automated fact-checking tools and APIs, investigating the integration of knowledge graphs for verification, and developing a methodology for combining these approaches to create a reliable factual accuracy scoring system.

1. Introduction LLMs like GPT-3, GPT-4, and similar models are capable of generating human-like responses but are not always accurate in terms of factual correctness. Ensuring that these responses align with verified information is essential for applications in critical fields such as healthcare, education, and legal advice.

This report explores methods to assess and verify the factual accuracy of LLM outputs through automated tools, APIs, and knowledge graphs, and suggests a plan to integrate these approaches into a scoring system for factual accuracy.

2. Automated Fact-Checking Tools and Techniques Overview of Fact-Checking Techniques

Fact-checking involves verifying the correctness of a claim against established facts or reference datasets. The goal of automated fact-checking is to use computational methods to analyze text and validate statements without requiring human intervention.

Automated Fact-Checking Tools: Several tools and APIs exist for automated fact-checking. Some of the key tools are:

Google Fact Check Tools: Google’s Fact Check Tools provide access to a vast database of verified information. It allows users to check whether a claim has been fact-checked by reputable sources.
Full Fact API: Full Fact is a UK-based fact-checking organization that provides APIs for verifying factual claims. It uses a combination of machine learning and expert human input to verify claims made in the media and public discourse.
ClaimBuster: ClaimBuster automatically detects check-worthy factual claims from text and matches them against fact-checked data.
PolitiFact API: PolitiFact offers an API that gives access to their fact-checking results, enabling developers to integrate it into their applications for real-time fact verification.

Natural Language Processing Techniques for Fact-Checking

Textual Entailment: This method evaluates whether a factual statement in a response can logically be inferred from known facts.
Semantic Similarity: This technique calculates how similar the meaning of an LLM-generated response is to an established fact in a reference database. Semantic similarity tools can be built using models such as BERT and T5.
Knowledge-Based Fact Extraction: This method compares LLM responses against structured datasets (e.g., Wikidata, DBpedia) to validate the presence of entities and facts.

3. Knowledge Graphs for Fact Verification Role of Knowledge Graphs Knowledge graphs represent facts as structured data in the form of entities, relationships, and attributes. These graphs are an invaluable resource for fact verification because they can provide accurate, contextually relevant information to validate claims.

Popular Knowledge Graphs

Wikidata: An open-source, community-driven knowledge graph that includes data on a wide range of topics, from historical events to scientific data.
DBpedia: A knowledge graph extracted from Wikipedia’s structured data, providing a broad range of verified facts.
Google’s Knowledge Graph: Used for enriching search results with entity-based information, it is one of the most comprehensive knowledge graphs available.

Using Knowledge Graphs for Verification Entity Linking: LLM responses often contain entities such as people, places, dates, and concepts. These entities can be linked to a knowledge graph to verify whether they correspond to known facts. For instance, if an LLM generates a response about a historical event, entity linking could check whether the event's date and participants align with the facts stored in the graph. Querying Knowledge Graphs: Knowledge graphs can be queried using languages such as SPARQL to retrieve specific facts and verify the correctness of an LLM response.

Challenges with Knowledge Graph Integration

Coverage: Knowledge graphs may not have up-to-date or comprehensive coverage, particularly in niche or emerging fields.
Ambiguity: Some facts might have multiple interpretations, leading to uncertainty in verification.

4. Combining Methods for Factual Accuracy Scoring Approach for Combining Fact-Checking and Knowledge Graphs To ensure a robust system for factual accuracy verification, a multi-step approach is recommended:

Initial Entity Matching and Verification: LLM-generated entities (names, dates, places) are first checked against knowledge graphs such as Wikidata. If the entity is found, the system validates the fact associated with it (e.g., a person’s birthdate). Textual Fact-Checking using APIs For broader factual claims (e.g., “the population of France is X million”), APIs like PolitiFact or Full Fact are called upon to verify whether such claims have been previously fact-checked and are accurate. Semantic Similarity and Textual Entailment If no direct fact-checking is available for a given response, semantic similarity techniques can be used to determine how close the generated response is to verified information in the knowledge graph or API. Textual entailment is employed to infer if the generated response logically follows from known facts. Factual Accuracy Scoring Scores from each stage (entity matching, API fact-checking, and textual entailment) are weighted and combined into a final factual accuracy score. Metric Definitions: Precision and recall can be calculated based on how well the system identifies correct and incorrect claims. A threshold is set to determine acceptable factual accuracy scores for different use cases. Final Pipeline for Factual Accuracy

Step 1: Receive LLM-generated response.
Step 2: Extract entities and link to a knowledge graph.
Step 3: Call fact-checking APIs to validate broader claims.
Step 4: Apply semantic similarity and textual entailment if no fact-checking results are available.
Step 5: Combine verification methods into a unified score for factual accuracy.

5. Conclusion Assessing the factual accuracy of LLM responses is a multifaceted problem that requires integrating various verification methods, from automated fact-checking APIs to knowledge graphs. By combining these methods and weighting their contributions, a reliable factual accuracy scoring system can be developed. This system can help improve trust in LLM-generated content, particularly in sensitive areas such as healthcare, finance, and legal advice.

Future research could focus on improving the coverage of knowledge graphs and the performance of real-time fact-checking systems, especially in niche domains.

References

Google Fact Check Tools: https://toolbox.google.com/factcheck
Full Fact API: https://fullfact.org/
ClaimBuster: https://idir.uta.edu/claimbuster/
PolitiFact API: https://www.politifact.com/

RyderSwanson / LLMEval

Research Factual Accuracy Metric #8

Details and Assumptions

Acceptance Criteria

Research Report on Assessing Factual Accuracy of LLM Responses