Database schema revisions

The current database schema (https://github.com/FullFact/health-misinfo-shared/issues/62#issuecomment-2118785064) has a few gaps. We’d like to store the following things:

multiple runs of the same youtube video (e.g. using different prompts) (#62)
the raw text that a claim was extracted from (#42)
offset end timestamps for inferred claims (not explicitly mentioned in #56, but related)
training claims for videos that may not have been run through claim extraction

Here’s a proposed revised schema. ~This is WIP, and it may also be more complicated than we really need. But hopefully it should capture the things mentioned above.~ UPDATE: @dcorney, @ff-dh, @JamesMcMinn and @andylolz discussed and agreed the following:

erDiagram
  youtube_videos ||--o{ claim_extraction_runs : runs
  youtube_videos {
    text id
    text metadata
    text transcript
  }

  claim_extraction_runs ||--o{ inferred_claims : claims
  claim_extraction_runs {
    integer id PK
    text youtube_id FK
    text model
    text status
    integer timestamp
  }

  inferred_claims {
    integer id PK
    integer run_id FK
    text claim
    text raw_sentence_text
    text labels
    real offset_start_s
    real offset_end_s
  }

  training_claims {
    integer id PK
    text youtube_id
    text claim
    text labels
  }

FullFact / health-misinfo-shared

Database schema revisions #71