FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google
MIT License
0 stars 0 forks source link

Show a smaller extract of the transcript for each paraphrased claim #90

Closed dcorney closed 4 months ago

dcorney commented 4 months ago

Overview

[NB: this is based on the current dev branch, which is soon to be merged into main]

Currently, when the user hovers over an extracted claim, a tooltip pop-up shows an extract of the raw transcript. However, this is a "chunk", corresponding to about 1-2 minutes of the video. This can be quite long, making it hard to find the source of the claim.

Ideally:

  1. the genAI model extracting the claim should also return the original sentence;
  2. this should be stored in the database along with the inferred claim; and then
  3. just the sentence should be shown in the pop up.

Requirements

Update the prompt so that it returns the original sentence along with the inferred claim.

Ideally, this should also be correctly punctuated and with an initial capital letter for readability.

Notes and additional information

Relevant code locations:

process.py / extract_claims() gets the raw_sentence_text returned by the LLM. This is stored in inferred_claims.

templates/video_analysis.html defines list-group-item which shows claim['raw_sentence_text'] as the tooltip.

We might want to consider showing 2-3 sentences if just one doesn't provide enough context.

andylolz commented 4 months ago

Investigate why the model is returning such large blocks of text - is the prompt wrong? Are we mixing up the sentence text with the chunk text at some point? Do we we need to do some post-processing to work out where in the chunk this particular claim was made?

^^ The prompt we’re currently using doesn’t ask for sentence text, so it’s not included in the response.

Because we don’t have the sentence text, I put the chunk text into raw_sentence_text instead. That’s not what that field is intended to be used for, though – I think we intend to put sentence text there from the LLM.