Handle multiple analyses of the same video

dcorney commented 4 months ago

Describe the bug

Currently, if someone enters the video id and clicks analyse for a video that has already been analysed, the model will extract the same (or similar) claims again and add them to the list.

Instead, we should list the video twice (or more times) on the main page, each linking to its associated set of claims.

This is especially important as we update the prompts/models etc. and may want to compare the same video before and after a change.

To Reproduce

Steps to reproduce the behaviour:

Go to raphael
enter a video id and click analyse
review the extracted claims
enter THE SAME video id and click analyse
observe more claims have appeared in the list, roughly doubling it in size

Expected behaviour

If the same id occurs twice in the list of analysed videos, distinguish by a version number. E.g.

4WAFHXdTMbY
4WAFHXdTMbY (v2)
4WAFHXdTMbY (v3)

etc.

Additional context

For reference: in Live, if the same YouTube video is analysed twice, two versions are shown. This is what we want here.

andylolz commented 4 months ago

This is especially important as we update the prompts/models etc. and may want to compare the same video before and after a change.

^^ Given this, rather than just v1, v2, v3, is it useful to keep track of some extra metadata, e.g.:

the timestamp of when the video was analysed;
some sort of reference for the prompt that was used (though not sure how we’d do this);
the git SHA (we could add this as an environment variable when deploying)

andylolz commented 4 months ago

The issue here is that we’re keying based on the YouTube ID, rather than our own ID. There would need to be some schema changes in order to store different runs of the same video separately.

The schema currently looks like this:

erDiagram
  video_transcripts ||--o{ training_claims : claims
  video_transcripts ||--o{ inferred_claims : claims
  video_transcripts {
    text id PK
    text url
    text metadata
    text transcript
    text status
  }

  training_claims {
    integer id PK
    text video_id FK
    text claim
    text label
    integer offset_ms
  }

  inferred_claims {
    integer id PK
    text video_id FK
    text claim
    text label
    text model
    integer offset_ms
  }

We’d need to move the current video_transcripts id to another field (e.g. youtube_id), add a new auto-increment ID to video_transcripts, and then point at that from the other tables.

dcorney commented 4 months ago

For an updated schema, refer to #71

FullFact / health-misinfo-shared