mhoangvslev commented 8 months ago

Context

Conformance and Factuality Checker use SemanticValidator and FactualValidator.
SemanticValidator and FactualValidator returns a score [0, 1], which is the ratio of positives over the number of key-value pairs.
The label assignment is based on a decision threshold: label = int(score > decision_threshold)
The precision, recall, f1 scores are calculated based on the label, therefore decision_threshold has important impact.
decision_threshold < 0.5 leads to higher recall, while decision_threshold > 0.5 leads to higher precision.
How to decide the right decision_threshold?

mhoangvslev commented 8 months ago

Previously, decision_threshold = 1 because we want a markup to be considered valid if and only if all information is correct.
However, with decision_threshold = 0.5, we tolerate minor inaccuracies like in SelfCheckGPT, i.e, if the majority of information is correct, the markup remains correct.
We observed better f1-score overall.

mhoangvslev commented 8 months ago

In the paper, this only affects the test results on Schema.org example dataset, where the results of Factual Checker are based on decision_threshold = 0.5 and results of Compliance Checker are based on decision_threshold = 1. The final results you see on this repo is based on decision_threshold = 0.5.
The main experiment is not affected since there is no label assignment.

GDD-Nantes / LLM4SchemaOrg