Open dcorney opened 4 months ago
This was the idea of #75. I.e. in the model field, we could store something like:
{
"model": "gemini-pro",
"prompt_sha": "b712b755083c852d4578cfff3f109b6f46c7fbd3",
}
…where prompt_sha
refers to the most recent commit in the health_misinfo_shared folder when the claim was created, e.g. b712b755083c852d4578cfff3f109b6f46c7fbd3.
The reason to do it this way is:
The downside is it’s not human-readable, so it’s not immediately obvious which version of the prompt this is / when it was changed. Also, the refers to a whole folder of content, so it might change because of something very minor (e.g. a linting change to an unrelated file). To mitigate this, we ought to ensure that as far as possible, the SHA is only tracking the stuff that will affect output.
Overview
When we use a Gemini model to extract claims from a video, we should store a useful identifier with the claim. In this way, we can aid reproducibilty and understand where claims came from. This should uniquely identify not just the class of model (e.g. Gemini) and its version (e.g.
1.5-pro
) but also relevant info about the prompt used to extract the claims and/or what data was used for fine-tuning or in-context learning.The code is still being restructured, so I'm not sure exactly where the change should go yet...
Requirements
e.g. in
src/raphael_backend_flask
(or wherever we end up writing to theclaim_extraction_runs
table), it currenlty writesgemini-pro
as the model but should be more specific. This could be passed fromraphael_backend_flask/process.py
when it callscreate_claim_extraction_run()
.