multiple runs of the same youtube video (e.g. using different prompts) (#62)
the raw text that a claim was extracted from (#42)
offset end timestamps for inferred claims (not explicitly mentioned in #56, but related)
training claims for videos that may not have been run through claim extraction
Here’s a proposed revised schema. ~This is WIP, and it may also be more complicated than we really need. But hopefully it should capture the things mentioned above.~ UPDATE: @dcorney, @ff-dh, @JamesMcMinn and @andylolz discussed and agreed the following:
erDiagram
youtube_videos ||--o{ claim_extraction_runs : runs
youtube_videos {
text id
text metadata
text transcript
}
claim_extraction_runs ||--o{ inferred_claims : claims
claim_extraction_runs {
integer id PK
text youtube_id FK
text model
text status
integer timestamp
}
inferred_claims {
integer id PK
integer run_id FK
text claim
text raw_sentence_text
text labels
real offset_start_s
real offset_end_s
}
training_claims {
integer id PK
text youtube_id
text claim
text labels
}
The current database schema (https://github.com/FullFact/health-misinfo-shared/issues/62#issuecomment-2118785064) has a few gaps. We’d like to store the following things:
Here’s a proposed revised schema. ~This is WIP, and it may also be more complicated than we really need. But hopefully it should capture the things mentioned above.~ UPDATE: @dcorney, @ff-dh, @JamesMcMinn and @andylolz discussed and agreed the following: