Improve generation of labels for in-context learning

Overview

In fine_tuning.py / construct_in_context_examples(), we generate labels of "worth checking", "maybe worth checking" or "not worth checking". These are derived from the multi-label results from the model, based on fact-checker provided data. Currently, the labelling is very crude - e.g. if the "harm" is "high" then add "worth checking".

Requirements

Think about this more carefully and come up with a better scoring system. Consider what labels we have available and how they should be interpreted, and how they might interact. One option would be to introduce a points system, where a claim gains "checkworthiness" points if it is predicted as high harm or suggesting actions, but loses points if it is personal experience or vague. Then convert the score back to a label.

Before implementing a new scoring system, propose it and open up for discussion.

FullFact / health-misinfo-shared

Improve generation of labels for in-context learning #91

Overview

Requirements