eye-on-surveillance / sawt

https://sawt.eyeonsurveillance.org/
MIT License
16 stars 10 forks source link

Transcription structure #74

Open marvinmarnold opened 1 year ago

marvinmarnold commented 1 year ago

Proposal for how to structure transcriptions:

type IDocument = {
  // autogenerated
  id: uuid;
  name: string;
  description: string;
  original_url: string;
  original_published_at: timestampz;
  original_format: 'video'
  original_source: 'youtube'
  type: 'full_council_meeting' | 'committee_meeting' ;
  subtype: 'regular' | 'special' | 'criminal_justice' | 'budget' | ...
}

type ISpeaker = {
  id: uuid;
  // This is the label the transcriber will use
  slug: string;
  role: 'council_member' | 'public' | 'gov_agency' | 'civic_society';
  name: string;
}

type IDocumentFragment = {
  id: uuid;
  document_id: fk;
  speaker_id: fk;
  // number of miliseconds into the video
  // you don't think this is necessary, but I don't understand why
  timestamp: int;
  text: string;
}

In order to transcribe a city council video from youtube, the transcriber should:

@ayyubibrahimi what tool do you think transcription should be done through? Google Sheets would be the easiest.

ayyubibrahimi commented 1 year ago

Agreed.

I'll go ahead and confirm that we've landed on those 4 speaker roles.

marvinmarnold commented 1 year ago

Great. I can make the spreadsheet pretty quick I think

marvinmarnold commented 1 year ago

@ayyubibrahimi does this make sense to you? https://docs.google.com/spreadsheets/d/1bJRjfZtKvzgdLgjDg4QN8OPVQjmjIUQj2PY8fK6ia8o/edit?usp=sharing

ayyubibrahimi commented 1 year ago

Makes sense in the long term. I think that we need to ensure we can reliably map the text that Caitlin transcribes to the data that has timestamps before finalizing a schema.

marvinmarnold commented 1 year ago

Ya, I'm not clear how you are thinking about doing that.

ayyubibrahimi commented 1 year ago

Simple in theory. Planning to begin experimenting soon. Brief overview:

Example of a chunk of data that contains timestamps:

{
"timestamp": "0:00-0:24",
"page_content": "back then libraries were so important at the school that we would go there and I learned my I got education I
was educated in Carver's Library so I have connections to the um school that I'm proud of but I'm so very pleased to represent
District D which includes Carver High School very proud of you all and as you head back to carver-ram's way just remember
that we're behind you 100 thank you thank you",
"url": "https://www.youtube.com/watch?v=Bl-Tv5yuUTw&ab_channel=NewOrleansCityCouncil",
"title": "City Council Meeting 2-2-2023",
"publish_date": "2/2/2023" 
},

Example of how I think the transcribed text should be formatted:

{
    {"text": "back then libraries were so", "speaker": "civic society"},
    {"text": "important at the school that", "speaker": "governmental agency"},
    {"text": "we would go there and", "speaker": "governmental agency"},
    {"text": "I learned my I got", "speaker": "city council member"},
    {"text": "education I was educated in", "speaker": "governmental agency"},
    {"text": "Carver's Library so I have", "speaker": "governmental agency"},
    {"text": "connections to the um school", "speaker": "public"},
    {"text": "that I'm proud of but", "speaker": "public"},
    {"text": "I'm so very pleased to", "speaker": "governmental agency"},
    {"text": "represent District D which includes", "speaker": "public"},
    {"text": "Carver High School very proud", "speaker": "public"},
    {"text": "of you all and as", "speaker": "governmental agency"},
    {"text": "you head back to carver-ram's", "speaker": "public"},
    {"text": "way just remember that we're", "speaker": "governmental agency"},
    {"text": "behind you 100 thank you", "speaker": "governmental agency"},
    {"text": "thank you", "speaker": "public"}
  ]
}

Because we're currently chunking data on a roughly 5 second interlude, the amount of tokens within a chunk should be relatively consistent. If we chunk the transcribed text similarly, we should be able to perform a simple string similarity search to match the transcribed text with the timestamps.

marvinmarnold commented 1 year ago

Because we're currently chunking data on a roughly 5 second interlude, the amount of tokens within a chunk should be relatively consistent.

I'm not convinced. Wouldn't caitlin's transcription need to be almost perfectly lined up with youtube one for this to work? I imagine the two will start to drift pretty fast. And if Caitlin needs to track 5 sec increments, why not just have her track her own timestamps? Maybe 5 sec is easier than for every soundbyte.

ayyubibrahimi commented 1 year ago

I don't think drifting is an issue if the string similarity match has pointers to the preceding and following chunks. Alternatively, she she can transcribe in increments of 60 seconds, for the sake of efficiency, and for the purposes of matching the strings, we can chunk the timestamp data on a 60 second interlude. These chunks can always be preprocessed further before they're read into the model.