Use Speech Recognition to Transcribe Oral Argument Audio

freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.

Other

538 stars 148 forks source link

We currently have about 7500 hours of oral argument audio without transcriptions. We need to go through these audio files and run a speech to text tool on them. This would have massive benefits:

Alerts based on things said in a court!
Transcription search
Written transcriptions

The research in this area seems to be taking a few different paths from what I've gathered. The tech industry mostly needs this for when people are talking to Siri, so most of the research is making it better able to hear little phrases rather than complex transcriptions like we have.

The other fission is between cloud-based APIs and software that you can install. Cloud based APIs have the best quality, and tend to be fairly turnkey. OTOH, installable software can be tuned to the corpus we have (legal audio), and doesn't have API limits or costs associated with it.

The good news seems to be that unified APIs seem to be bubbling to the surface. For example, here's a Python library that lets you use:

CMU Sphinx (an installable)
Google Speech Recognition
Wit.ia (some thing that Facebook apparently now owns)
IBM Speech to Text (1000 hours/month free, I think)
AT&T Speech to Text

Pretty solid number of choices in a single library. On top of these, there are a few other providers that only do speech recognition, and even YouTube (where @brianwc now works) does captions on videos. We've talked to a few of the speech-to-text startups, but none have had any interest in helping out a non-profit. Start-ups, am I right?

Anyway, there's clearly a lot to do here. An MVP might be to figure out the API limits and start pushing to the cloud using as many of the APIs as needed, though that probably brings a lot of complexity and variance in quality. Even using IBM's free tool, we could knock out our current collection in about eight or nine months. More comments on this over on hacker news too.

PS: I swear there used to be a bug for this, but I can't find it, so I'm loading this one with keywords like transcription, audio, oral arguments, transcribe, recognition...

OK, this is getting closer to done:

In https://github.com/freelawproject/courtlistener/pull/4137, we tweaked things to reduce the hallucination problem by removing the case name as a prompt. The remaining "hallucinations" are probably better described as OpenAI doing its darnedest to make sense of horrible recordings. A human could do better, but some of these are very bad.
In https://github.com/freelawproject/doctor/pull/194, added support to doctor to downsample mp3s and convert them to ogg, so that most of our large files are small enough, and in https://github.com/freelawproject/courtlistener/pull/4141, we added it to CourtListener.

A few remaining things:

We still have 154 files, (0.16%) that are too big to process.

There seems to be a two hour limit in the OpenAI API, so we either need to either:
- cut some of these into smaller pieces that we then merge back together;
- only do the first two hours of these and then tell the user as such;
- we need to try speeding them up to see if we still get good enough quality; or
- just throw an error for users that says, "Sorry, this file was too big to transcribe."
I think I lean towards trying a sped-up version, then trying just the first two hours, then throwing an error if none of that works.
We haven't set up a process to do this for every new item we scrape. It cost about $21k to do 90k files, so that's 23¢ each. I think we can commit to that — unless we go and get thousands of additional files!
We need a new issue to discuss how we'll add these files to the UI (and we need to do so).
We need to do a blog post, etc.

What else?

freelawproject / courtlistener

Use Speech Recognition to Transcribe Oral Argument Audio #440