Open marvinmarnold opened 1 year ago
Agreed.
I'll go ahead and confirm that we've landed on those 4 speaker roles.
Great. I can make the spreadsheet pretty quick I think
@ayyubibrahimi does this make sense to you? https://docs.google.com/spreadsheets/d/1bJRjfZtKvzgdLgjDg4QN8OPVQjmjIUQj2PY8fK6ia8o/edit?usp=sharing
Makes sense in the long term. I think that we need to ensure we can reliably map the text that Caitlin transcribes to the data that has timestamps before finalizing a schema.
Ya, I'm not clear how you are thinking about doing that.
Simple in theory. Planning to begin experimenting soon. Brief overview:
Example of a chunk of data that contains timestamps:
{
"timestamp": "0:00-0:24",
"page_content": "back then libraries were so important at the school that we would go there and I learned my I got education I
was educated in Carver's Library so I have connections to the um school that I'm proud of but I'm so very pleased to represent
District D which includes Carver High School very proud of you all and as you head back to carver-ram's way just remember
that we're behind you 100 thank you thank you",
"url": "https://www.youtube.com/watch?v=Bl-Tv5yuUTw&ab_channel=NewOrleansCityCouncil",
"title": "City Council Meeting 2-2-2023",
"publish_date": "2/2/2023"
},
Example of how I think the transcribed text should be formatted:
{
{"text": "back then libraries were so", "speaker": "civic society"},
{"text": "important at the school that", "speaker": "governmental agency"},
{"text": "we would go there and", "speaker": "governmental agency"},
{"text": "I learned my I got", "speaker": "city council member"},
{"text": "education I was educated in", "speaker": "governmental agency"},
{"text": "Carver's Library so I have", "speaker": "governmental agency"},
{"text": "connections to the um school", "speaker": "public"},
{"text": "that I'm proud of but", "speaker": "public"},
{"text": "I'm so very pleased to", "speaker": "governmental agency"},
{"text": "represent District D which includes", "speaker": "public"},
{"text": "Carver High School very proud", "speaker": "public"},
{"text": "of you all and as", "speaker": "governmental agency"},
{"text": "you head back to carver-ram's", "speaker": "public"},
{"text": "way just remember that we're", "speaker": "governmental agency"},
{"text": "behind you 100 thank you", "speaker": "governmental agency"},
{"text": "thank you", "speaker": "public"}
]
}
Because we're currently chunking data on a roughly 5 second interlude, the amount of tokens within a chunk should be relatively consistent. If we chunk the transcribed text similarly, we should be able to perform a simple string similarity search to match the transcribed text with the timestamps.
Because we're currently chunking data on a roughly 5 second interlude, the amount of tokens within a chunk should be relatively consistent.
I'm not convinced. Wouldn't caitlin's transcription need to be almost perfectly lined up with youtube one for this to work? I imagine the two will start to drift pretty fast. And if Caitlin needs to track 5 sec increments, why not just have her track her own timestamps? Maybe 5 sec is easier than for every soundbyte.
I don't think drifting is an issue if the string similarity match has pointers to the preceding and following chunks. Alternatively, she she can transcribe in increments of 60 seconds, for the sake of efficiency, and for the purposes of matching the strings, we can chunk the timestamp data on a 60 second interlude. These chunks can always be preprocessed further before they're read into the model.
Proposal for how to structure transcriptions:
In order to transcribe a city council video from youtube, the transcriber should:
@ayyubibrahimi what tool do you think transcription should be done through? Google Sheets would be the easiest.