Open mlissner opened 4 months ago
To transcribe audio, I would recommend to take a look at OpenAI's Whisper + a long context LLM (gives you more customization options), or AssemblyAI (can separate speaker 1 and speaker 2 , etc without any customization). If you don't want to run your audio files through transcription again, I can write a script to feed the content of each text file into a fine-tuned or properly prompted model. IMHO, Gpt-4o or claude sonnet 3.5 should be able to pick up where one speaker stops and another starts, which speaker is a judge or a party, and a sentiment. This will cost $$, but at some point you may want to reach out to OpenAI, Anthropic, Cohere and AssemblyAI and explain the public benefit of your project and ask for credits.
We're talking to OpenAI already: https://openai.com/index/data-partnerships/
And we used Whisper (via the OpenAI partnership) to transcribe. So I guess my question is how you'd take a transcription we have and have it identify speakers? (The first run of our content through Whisper cost around $20k.)
oh my... that's a significant spend, hope they gave you some credit. i can run whisper on my server next time you need it. basically, you either 1) fine tune an open-source model yourself or using third party infra like Predibase, or 2) use one of the good commercial models (i like gpt-4o and claude-sonnet-3.5) with a prompt and examples. you can use some transcript from the SCOTUS transcriptions to train or for the prompt. but first step, i would be clear about the use case and typical queries you'd like to be able to address with this service, then it's easier to fine-tune or prompt. take a look at assemblyai, they specialize in transcriptions.
Yeah, we had to get it done quickly, but thank you for the offer. We'll have to get smarter about this kind of thing in the future.
I think identifying who says what is a huge deal, particularly if you can figure out who is who. One idea is making a moot court chat interface, where you argue with the judge and they argue back, in their voice (deep voice?), saying things that they'd be likely to say.
Another use case is having better transcripts so people can read along in the browser (see Oyez.org). Allowing search by sentiment is another, and so forth.
So I guess I'm not sure what we really want diarization for — or at least it's not one obvious thing, but it feels like the next step in this process though.
There are a few ways to do this:
Figure out people's actual names by listening to how they address each other and/or using the information we know from scraping, like the names of the judges on the panel.
Just call people speaker1, speaker2, etc.
I haven't looked into how to do this, but I gather there are a bunch of AI methods these days. Definitely something to research. If anybody wants to pick this up, I'd love to see a feature/quality/price/etc comparison across diarization methods.