FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google
MIT License
0 stars 0 forks source link

Store useful model identifier when extracting claims (placeholder) #89

Open dcorney opened 4 months ago

dcorney commented 4 months ago

Overview

When we use a Gemini model to extract claims from a video, we should store a useful identifier with the claim. In this way, we can aid reproducibilty and understand where claims came from. This should uniquely identify not just the class of model (e.g. Gemini) and its version (e.g. 1.5-pro) but also relevant info about the prompt used to extract the claims and/or what data was used for fine-tuning or in-context learning.

The code is still being restructured, so I'm not sure exactly where the change should go yet...

Requirements

e.g. in src/raphael_backend_flask (or wherever we end up writing to the claim_extraction_runs table), it currenlty writes gemini-pro as the model but should be more specific. This could be passed from raphael_backend_flask/process.py when it calls create_claim_extraction_run().

andylolz commented 4 months ago

This was the idea of #75. I.e. in the model field, we could store something like:

{
  "model": "gemini-pro",
  "prompt_sha": "b712b755083c852d4578cfff3f109b6f46c7fbd3",
}

…where prompt_sha refers to the most recent commit in the health_misinfo_shared folder when the claim was created, e.g. b712b755083c852d4578cfff3f109b6f46c7fbd3.

The reason to do it this way is:

The downside is it’s not human-readable, so it’s not immediately obvious which version of the prompt this is / when it was changed. Also, the refers to a whole folder of content, so it might change because of something very minor (e.g. a linting change to an unrelated file). To mitigate this, we ought to ensure that as far as possible, the SHA is only tracking the stuff that will affect output.