Use Speech Recognition to Transcribe Oral Argument Audio

mlissner commented 8 years ago

We currently have about 7500 hours of oral argument audio without transcriptions. We need to go through these audio files and run a speech to text tool on them. This would have massive benefits:

Alerts based on things said in a court!
Transcription search
Written transcriptions

The research in this area seems to be taking a few different paths from what I've gathered. The tech industry mostly needs this for when people are talking to Siri, so most of the research is making it better able to hear little phrases rather than complex transcriptions like we have.

The other fission is between cloud-based APIs and software that you can install. Cloud based APIs have the best quality, and tend to be fairly turnkey. OTOH, installable software can be tuned to the corpus we have (legal audio), and doesn't have API limits or costs associated with it.

The good news seems to be that unified APIs seem to be bubbling to the surface. For example, here's a Python library that lets you use:

CMU Sphinx (an installable)
Google Speech Recognition
Wit.ia (some thing that Facebook apparently now owns)
IBM Speech to Text (1000 hours/month free, I think)
AT&T Speech to Text

Pretty solid number of choices in a single library. On top of these, there are a few other providers that only do speech recognition, and even YouTube (where @brianwc now works) does captions on videos. We've talked to a few of the speech-to-text startups, but none have had any interest in helping out a non-profit. Start-ups, am I right?

Anyway, there's clearly a lot to do here. An MVP might be to figure out the API limits and start pushing to the cloud using as many of the APIs as needed, though that probably brings a lot of complexity and variance in quality. Even using IBM's free tool, we could knock out our current collection in about eight or nine months. More comments on this over on hacker news too.

PS: I swear there used to be a bug for this, but I can't find it, so I'm loading this one with keywords like transcription, audio, oral arguments, transcribe, recognition...

mlissner commented 8 years ago

I'm told by a friend that Trint might be an option to look at. Looks more integrated than we probably want though.

waldoj commented 8 years ago

FWIW, I tested the quality of court audio transcription (using Virginia state court audio), and posted my conclusions here. Speechmatics offered the best bang for the buck.

mlissner commented 8 years ago

Really good to know where the quality is, @waldoj. Their prices are crazy though. Our 7500 hours of content at the price you mentioned (6¢/minute) comes to $27k. We'd need some sort of non-profit agreement...

mlissner commented 8 years ago

...or money.

djeraseit commented 8 years ago

I sent a request to Google Cloud Speech for beta access as it would probably be the most accurate with their natural language processing system. Unfortunately, each audio clip can be a maximum of 2 minutes long.

djeraseit commented 8 years ago

This project looks promising for long term viability: https://github.com/pannous/tensorflow-speech-recognition

Google has open sourced the software for neural network and Facebook open sourced the hardware.

mlissner commented 8 years ago

@djeraseit We've got beta access to Google Cloud Speech and we're playing with it. It seems to work, but the quality is actually very low. Right now we're attempting to process as much audio as possible before Google makes it a pay service.

The max of 2 minutes was also lifted, btw.

djeraseit commented 8 years ago

IBM Watson has speech to text API. First 1,000 minutes per month are free.

http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/speech-to-text.html

mlissner commented 8 years ago

Yep, this is in my original ticket, above:

IBM Speech to Text (1000 hours/month free, I think)

mlissner commented 7 years ago

Big news here. We've signed a contract with Google, and they are giving us credits to transcribe all our existing oral argument audio recordings using their Speech to Text API. They will also be giving us credits going forward to do this on an ongoing basis.

We'll be announcing this properly once we've wrapped up this bug, but for now I want to get to work building this feature. I've outlined the process below, but could really use help getting it done, as I'm stretched quite thin:

[x] Add a new field for the transcripts in the database. Name it transcript_google or similar, so we know where the field's value came from.
[x] Add a new search field to the Audio search endpoint.
- [x] Make this work for oral argument alerts (should happen automatically)
[ ] Figure out how to use the speech to text API to create celery tasks that send the correct audio files to the API.
- [ ] Limit of 80 minutes per track.
- [ ] Hard limit of 1.72 million seconds per day, and a rate limit of 25 seconds of uploaded audio per second in the day, which works out to 2.16 million seconds per day. Therefore, we'll hit the daily cap before the rate cap. We currently have 50,540,000 seconds of audio, so this will take about 30 days.
- [x] Files must be formatted as LINEAR16. Sample rates can be between 8000 Hz and 48000 Hz. with 16k recommended (but do not resample)
- [x] Poll for results (open question: how long are results available?)
- [x] Files must be in the Google cloud (cannot be uploaded as base64 or with a public URL)
- [x] Phrases can be provided, but only 500 per request, 10,000 chars per request, and 100 chars per phrase.
[ ] Send all existing data to Google to generate transcripts.
[ ] Update the scraper to transcribe any new content that comes in.
[ ] Expose the transcripts on the oral argument pages (but probably hide it behind a button/warning: "Transcripts automatically generated by Google and will have errors or be difficult to read.").
[ ] Write blog post, announce it, etc.

waldoj commented 7 years ago

Yay! Are you worried about the transcript quality? (Or is that question mooted by the fact that you can get_ a transcript with Google, so comparing it to a hypothetical more expensive system is an exercise in futility?)

mlissner commented 7 years ago

My concerns are mostly mooted by getting transcripts from Google in the first place. But also, the other pretty awesome part of this deal is what's in it for Google: We're giving them all of our audio as a zip (it's like 700GB or something), and they're going to use that for their machine learning, because they need big data sources with multiple speakers. So, I'm hopeful that they'll use our data to train their system, and boom, quality will be really good.

Anyway, even if that weren't true, this is an amazing offer. Everything I've heard is that they've got the best quality speech to text of anybody.

mlissner commented 7 years ago

Some big progress here that will be landing as soon as tests complete successfully:

Our database models are updated.
I've set this up to run as a celery "chain". This means that things will be processed by a series of async tasks:
- The first task will re-encode the file in LINEAR16 format, then upload it to google storage.
- The second task will request that google do speech to text on the uploaded file.
- The third task will poll Google for results to the second task, with an exponential back off. It will begin looking for results after five minutes, then ten, twenty, fourty, etc. It will give up after waiting 320 minutes. (This API takes about as long to run as the uploaded files and our longest is about 240 minutes). Once the poll completes successfully, it will be saved to the DB, and from there to Solr.
- A final task comes through and does cleanup (though this isn't strictly necessary, because Google is set to auto-delete all content after seven days).
I've updated Solr to have the transcript results in search results.

There's still more to do here, but this is the bulk of it. The missing pieces are:

[x] Add snippets to search results (these were terrible before, should be better now).
[ ] Kick off a script to do all the old content (or just do it in a screen session). For this, it looks like we should wait for some backend upgrades from Google. In particular, they plan to soon have timestamps in the results, which they don't currently.
[ ] Figure out how to handle API limits, if they happen (I don't think we'll hit these unless we further optimize the pipeline).
[ ] Update the scraper.
[ ] Expose the transcripts on the oral argument pages (I'll want to look at the quality some more before doing this though).
[ ] Update the FAQ to mention transcripts.

ghost commented 4 years ago

Hi @mlissner, I just heard about this issue from @azeemba at the supreme court transcripts project. Would love to contribute. What's the current state of generating transcripts?

mlissner commented 4 years ago

Honestly...I'm not sure! I haven't looked at this in a while, but there's a lot of code for doing text-to-speech. I think where it wound up was that the quality wasn't good enough to read as a human, but it probably would be good enough for alerts and search. I think we still have credits with Google to convert stuff, so if you wanted to pick this up and run with it, that could be pretty great.

mlissner commented 4 years ago

FWIW, I just yanked a bunch of this code in https://github.com/freelawproject/courtlistener/commit/293b3b3cd828aa0a3bfb98b26811af286f69c428. That's not to say we shouldn't do this, but I'm a code-deleting kick and I'm deleting any code that has lingered without use.

ghost commented 4 years ago

Hi @mlissner got it. I'm thinking that trying to get really good transcripts is beyond the scope of my NLP knowledge at this point, but would be great to come back to this later on as I work on that!

mlissner commented 11 months ago

Man oh man, I haven't updated this ticket in a while. This is now quite feasible and at high quality by using Whisper. This would be a great project for a volunteer to run with.

waldoj commented 11 months ago

I love it when enough time passes that a long-standing open issue goes from implausible to highly plausible.

mlissner commented 11 months ago

This one is pretty incredible that way. The last missing piece is to get tens of thousands of dollars in compute so we can run Whisper, but we're working on that. It's become a money problem, not a technology problem.

mlissner commented 9 months ago

Another possible breakthrough here. Somebody on Reddit got the price of transcription using Whisper down to $0.00059/min. I don't know if we can replicate it, or if there are gotchas, but if that holds, that'd put our 2764800 minutes of audio at a price of about $1600.

https://www.reddit.com/r/MachineLearning/comments/16ftd9v/p_whisper_large_benchmark_137_days_of_audio/

azeemba commented 9 months ago

This is very promising! I did have a question on voice-recognition though. Whisper is going to give audio-to-text but it doesn't have any system to differentiate between different voices right? Is there value in exploring custom tooling on top that can slice the audio by voices first?

I guess even then we won't be able to automatically match the voices to the real participants but perhaps it's still better to consume?

mlissner commented 9 months ago

Yes, that's called diarization. I think we need to at least future out how it's generally done, and then we can prioritize doing it.

waldoj commented 9 months ago

Generally, diarization is a separate datastream, where you find up with a list of timestamps and speaker IDs (e.g., speaker 1, 00:00–00:54; speaker 2 00:59–1:06; etc.). You'd merge that output with the transcript output to identify speakers. If you build your pipeline right, you can add diarization at a later point.

mlissner commented 9 months ago

But how do you figure out which speaker is which?

ddcas9 commented 9 months ago

Assembly AI is pretty much state of the art for that! It would do those 2.7 million minutes for double the price give or take a few hundred I think, depending on how much you want speaker tagged.

It would be interesting to think about what would happen if big bot were an LLM that can fully leverage what Doctor probably looks like at this point. It could also, in principle, turbocharge modernizations of different components of it. I wonder if the project has had some time to experiment with this or if that's above top secret haha.

waldoj commented 9 months ago

But how do you figure out which speaker is which?

Historically, that had to be done manually. But given how common it is for speakers to identify others by name ("you may proceed, Ms. Gonzalez"), I have to believe that AI is the solution here. AssemblyAI appears to only diarize but not identify speakers, but surely somebody is going to create a tool for this soon.

ddcas9 commented 9 months ago

It seems like that's the case! After some quick maths though, I just realized that Google Cloud Speech with Vertex AI is actually slightly more affordable than Assembly at the scale of 2.7 million minutes or so. After checking the docs, it looks like the way Recognition works is by taking into account explicit name tagging as in your example.

Edit: With the team's permission, I'd love to try this with a minimal working pipeline that I've been using for something else in Google Colab. It's breaks the audio into 1 minute chunks using PyDub, and stores those in GCS, though I haven't implemented diarization support yet. I could add merging and diarization to get a complete transcript of several arguments.

grossir commented 1 month ago

This is a draft for implementing a management command to transcribe our oral arguments audio files using OpenAI's whisper-1 model. Please check it @mlissner

API considerations

No batch processing available for whisper-1
Rate limits: It depends on the user tier. whisper-1 has a Request Per Minute (RPM) limit of 50 for tier 1 and 2, 100 for tier 3 and 4, 500 for tier 5. The documentation does not list any other rate limit
Other limits: Uploaded files have a 25MB limit. In order to manage this, the documentation suggest splitting the audio file, and passing extracted text context from the previous requests to the following requests.

A quick look into the Courtlistener API shows some audio files exceeding the size limit. I will compute a better estimate using the dev DB. Some examples:

1. Duration: 2563 seconds. 36.9 MB (would need 2 requests)
2. Duration: 2965 seconds. 83.4 MB (would need 3 requests)

Having 1440 minutes in a day, assuming we can max out our "user tier RPM"

Tier 1 (50 RPM): at most 72 000 a day
Tier 2-3 (100 RPM): at most 144 000 a day
Tier 5 (500 RPM): at most 720 000 a day

Courtlistener's API currently says that we have 90 000 audio files. Some of the files exceed the 25MB limit, so we will need more than 90k requests. If our deadline is June 19th, it seems that even Tier 1 would be reasonable if we manage to have the command working this week.

Model changes

Naming

The Audio model has 2 fields to store speech-to-text. However, the naming should be corrected

stt_status: SmallIntegerField (Choices: STT_NEEDED, STT_COMPLETE, STT_FAILED)
stt_google_response: TextField

Other model fields

If we want to eventually perform diarization, it would be a good idea to store the timestamps and probabilities that the model returns. The model has 2 levels of granularity for metadata: "word" and "segment". The documentation mentions that "segment" level metadata takes no additional processing, but "word" level does.

From the docs

Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

An example of the segment level metadata. If we don't want to think about this right now, we could store the output for possible future use in a JSONB field, in a separate table

[
  ...,
 {'id': 468,
  'seek': 179600,
  'start': 1817.0,
  'end': 1820.0,
  'text': " I'm happy to answer any other questions, but I think my time is out.",
  'tokens': [51414, 286,478,2055, 281,1867,604,661,1651,11,457,286,519,452,565,307,484,13,51564],
  'temperature': 0.0,
  'avg_logprob': -0.22341343760490417,
  'compression_ratio': 1.6679104566574097,
  'no_speech_prob': 0.00955760758370161},
 {'id': 469,
  'seek': 179600,
  'start': 1820.0,
  'end': 1822.0,
  'text': ' Thank you, counsel.',
  'tokens': [51564, 1044, 291, 11, 10351, 13, 51664],
  'temperature': 0.0,
  'avg_logprob': -0.22341343760490417,
  'compression_ratio': 1.6679104566574097,
  'no_speech_prob': 0.00955760758370161},
]

Management command to transcribe audio using OpenAI's whisper-1 model

We would need a different process to transcribe audio files that exceed 25MB, since the transcription works better if text context from the previous part' transcription is sent. Thus, they should be processed sequentially and not in parallel.

For files smaller than 25MB

In a loop over a cursor of Audio objects where stt_status != STT_COMPLETE and where the Audio size is 25 MB or less

Prepare {our tier RPM} parallel requests to OpenAI API
On completion of the request, store results and/or failing status
Sleep ("exponential backoff") if a rate limit header is detected

For files bigger than 25MB

Get a list of Audio objects where stt_status != STT_COMPLETE and where the Audio size is more than 25MB
Split the file in proper size chunks, send a request for each file using the tokenized output of the already extracted text as input for the new request. Store the transcript only if all requests were successful.
Sleep ("exponential backoff") if a rate limit header is detected

mlissner commented 1 month ago

This all sounds great, thank you Gianfranco.

even Tier 1 would be reasonable

That's great, thanks. I'm working to get us up to tier 3.

"segment" level metadata takes no additional processing, but "word" level does.

I don't care if things take a little longer, but I guess if we get word-level metadata, we'll have more flexibility, right?

Split the file in proper size chunks,

OpenAI suggests using PyDub for this, but we have ffmpeg over in doctor and we usually do binary manipulation over there. I guess it'd have to return a zip so you can get back multiple binary files at once. :(

I also saw that you can add unusual words to the prompt. I'd suggest adding the case name at a minimum, and maybe adding some other court-related words, if necessary.

Do we need to change the encoding of any of the files?

What do you propose for the model changes?

grossir commented 1 month ago

I got some proper statistics and a better sense from the dev DB (86 238 audio_audio rows from 16 courts, not much difference with the current 90k)

The good news first: 98% of the audio_audio files have a duration lower than 4000 seconds, which maps to a file size in our buckets smaller than 25 MB. This allows us to transcribe most of our data using the OpenAI API without further data processing

Since we want to get this as quick as possible, I would implement and deploy the transcription for these files, and tweak the process for the longer ones.

Some files between 3500-4000 seconds of duration escape the limit size, as can be seen on the graph below. For these, we can simply test their length before sending the request, or we can make a more complex query, since the size is related to the year of creation: it seems like a change in our processing of the audio files in 2020 increased the size of the audio files or extracts incorrectly the duration (that's the upper line). The lower points belong to 2014-2019

On a second look on the difference between size/duration in different years, here are 2 examples:

one created in 2014 with duration 3029 that lasts 50:25 on the audio player (which roughly matches 3029/60 = 50.48)
another created in 2023 with duration 3028, but lasts 58:46 on the audio player

I don't care if things take a little longer, but I guess if we get word-level metadata, we'll have more flexibility, right?

I think it may help. However, the word level output loses the quantitative metadata (probabilities, "temperature"), which I don't know if the diarization algorithms use. Perhaps @quevon24, who showed me some diarization scripts, knows about this. The word level output looks like this:

{'word': 'Attorney', 'start': 2633.02001953125, 'end': 2633.139892578125}, {'word': 'Colon', 'start': 2633.139892578125, 'end': 2633.419921875}, {'word': 'and', 'start': 2633.419921875, 'end': 2633.659912109375}, {'word': 'Attorney', 'start': 2633.659912109375, 'end': 2633.9599609375}, {'word': 'Lilly', 'start': 2633.9599609375, 'end': 2634.179931640625}

Do we need to change the encoding of any of the files?

If we use our stored files, we won't have to. All files in the bucket https://storage.courtlistener.com/mp3 have an mp3 extension. Only 162 have no local_path_mp3

In my previous comment I used the download_url, that why I was getting bigger file sizes, I think.

What do you propose for the model changes?

Adding a metadata table in Postgres' JSONB type. We could also give it a more regular structure since the word level metadata is only "word", "start", "end"), but that would create many rows for each audio transcription

Then, some small changes to the Audio model, accounting for the source of the transcription

from django.db import models

class AudioTranscriptionMetadata(models.Model):
    audio = models.ForeignKey('Audio', on_delete=models.DO_NOTHING)
    metadata = models.JSONField(
        help_text="Word level metadata returned by a STT model."
        "May be used in the future for diarization. Whisper output contains "
        "the `word`, the `start` and `end` seconds"
    )

class Audio:
    ...

    STT_OPENAI_WHISPER = 1
    STT_SELF_HOSTED_WHISPER =2
    STT_SOURCES = [
        (STT_OPENAI_WHISPER, "whisper-1 model from OpenAI API"),
        (STT_SELF_HOSTED_WHISPER, "Self-hosted Whisper model"),
    ]

    stt_source = models.SmallIntegerField(
        help_text="Source used to get the transcription",
        choices=STT_SOURCES,
        blank=True,
        null=True,
    )

    # rename current stt_google_response field
    stt_transcript = models.TextField(
        help_text="Speech to text transcription",
        blank=True,
    )

The text output has punctuation but no spaces. It may need some formatting for visualization? I guess that after diarization that should be a lot easier

"The next case today is Marvin Pagan-Lisboa et al. versus Social Security Administration et al., number 20-1377. Attorney Colon, please introduce yourself for the record and proceed with your argument. Good morning. This is Attorney Javier Andres Colon-Volgamor. I'm representing Ms. Marie Pagan-Lisboa and Mr. Daniel Justiniano, and a class of all The petitioners were initially claimants, then plaintiffs, and now appellants. I'd like to reserve two minutes for rebuttal. I'll make a brief statement and use the remainder of my time to answer any questions that this panel may have. A brief three-ish for the most. The Social Security Administration is abusing its power. That is our position, and I say it in the present tense. They started in 2013 when they began implementing a new Section 205U redetermination policy, and they keep on changing the process as they go along. They avoid challenges to their policy by dismissing these cases, and now, after Mr. Justiniano has finally exhausted administrative procedures, they want to get these cases back under their jurisdictions. In this fight, my clients have the law, and by the law, I mean the Constitution, the statute, the regulation, judicial precedents, think Hicks and Matthews v. Eldridge, and the legislative history. My clients have the facts on their side. They have any formulation of justice is on their side. Any sense of compassion or fulfilling the purpose for which SSA was created in the first place is on their side, and logical reasoning. As of now, have their benefits been restored? None. No. There was a, and here I'm getting out of the message that I wanted to give, but there was the district court order to restore Ms. Pagan's benefits, and they have not been restored because the agency has unilaterally stayed the process and frozen her process by saying that that's what they do. But that in itself is contrary to the rules of civil procedure 62 and the federal rules of appellate procedure 8, where they have to request a stay of the district court's order to be able to stay it. And anyways, neither the Social Security Act, the regulation, nor the HALICS instructions, none of them contemplate staying or holding, freezing claims while they are before the appeals, the circuit of appeals. Well, counsel, I mean, the, just to follow up on Judge Thompson's question, the, I mean, the district court did enter an order with respect to appellate Pagan, specifically stating that the commissioner shall reinstate plaintiff's benefits. The last time I met with Ms. Pagan, I had to meet with her at her bed. And I'm sorry that I'm getting, it's, let me take a second to pull myself back in. Last time I met with Ms. Pagan, I had to meet at her bed because she didn't have the ability to get out and about. She did make the effort to move to her dining room so that we ended up having lunch. But no, Ms. Pagan remains without the benefits and more importantly, remains without Medicare...

mlissner commented 1 month ago

In my previous comment I used the download_url, that why I was getting bigger file sizes, I think.

Yes, that's the link to the original file from the court, not our optimized version.

Only 162 have no local_path_mp3

We should do a quick re-run to process those. I imagine a little script could just be pasted into ipython.

I think the model changes seem OK to me. I think we'll wind up tweaking them later, but if we're saving everything, then I think we're good to go.

flooie commented 1 month ago

@grossir

Are you suggesting we throw away the response from whisper and store the transcript like that?

grossir commented 1 month ago

@flooie I am using the code below. We would store the "plain" text in the Audio.stt_transcript field, and the JSON with [{'word': 'The', 'start': 0.0, 'end': 0.5}, {'word': 'next', 'start': 0.5, 'end': 0.8799999952316284}] in another table for diarization use. I don't think we would we throwing away much, or is there more metadata stored someplace?

I also see that it returns a "duration", perhaps we could use that to correct the incorrect durations we have found

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

audio_file = open("75010_16.375178.mp3", "rb")
transcript = client.audio.transcriptions.create(
  file=audio_file,
  model="whisper-1",
  language="en",
  response_format="verbose_json",
  timestamp_granularities=["word"]
)

In [97]: transcript.to_dict().keys()
Out[97]: dict_keys(['text', 'task', 'language', 'duration', 'words'])

In [98]: transcript.to_dict()['duration']
Out[98]: 2636.330078125

In [99]: transcript.to_dict()['language']
Out[99]: 'english'

In [100]: transcript.to_dict()['task']
Out[100]: 'transcribe'

In [101]: transcript.to_dict()['words'][:2]
Out[101]: 
[{'word': 'The', 'start': 0.0, 'end': 0.5},
 {'word': 'next', 'start': 0.5, 'end': 0.8799999952316284}]

In [102]: transcript.to_dict()['text'][:100]
Out[102]: 'The next case today is Marvin Pagan-Lisboa et al. versus Social Security Administration et al., numb'

mlissner commented 1 month ago

Yeah, let's add the duration, since the way we do it isn't very accurate.

Gianfranco, isn't the thing we're throwing away the confidence values that is only available if we request segments? My understanding is we can choose between word timings and segment confidences?

grossir commented 1 month ago

In the end we can request both in 1 request. timestamp_granularities takes a list.

transcript = client.audio.transcriptions.create(
  file=audio_file,
  model="whisper-1",
  language="en",
  response_format="verbose_json",
  timestamp_granularities=["word", "segment"]
)

And, we can store them both in the same JSONB field in AudioTranscriptionMetadata

mlissner commented 1 month ago

Now we're talking!

flooie commented 1 month ago

I ran some numbers, that I think can help hone in exactly what we want to do. I grabbed all the size and duration information that was available.

of the ~90k audio files, about 200 are missing duration information - and a handful are missing files. Excluding those from this analysis

1,505 files are larger than 25 MB
74 are larger than 50 MB
7 are larger than 100

I would be curious if cutting the bitrate in half for the 25-50 MB would affect the quality of the output enough. I suspect it's good enough. That would get us to 99.917% at or under 25MB. And then we can handle those outliers.

I made a histogram of the durations and the file sizes. size_mb

waldoj commented 1 month ago

FWIW, if the files are stereo, you'd want to drop them to mono. And you might also experiment with reducing the sample rate. As low as 8 kHz may get equally-good results.

(I'm sorry if I'm just driving by and dropping in advice that's quite obvious to you!)

flooie commented 1 month ago

@waldoj

oh thats fantastic - Im not an audiophile or well informed on audio formats. But reducing to mono 8k would probably get everything under the magic 25mb threshold

grossir commented 4 weeks ago

On the OpenAI forums there are comments about doing that, and from the feedback, it seems to work

I take a 64k stereo mp3 and mash it with OPUS in an OGG container down to 12kbps mono, also using the speech optimizations. Command line is below: ffmpeg -i audio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip audio.ogg

I haven't tried it myself yet, but @flooie sent me a 5 hours - 20MB file that returned an InternalServerError from the API, repeteadly: InternalServerError: Error code: 500 - {'error': {'message': 'The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID req_b3716aa50fcb212e6a813a807215a382 in your email.)', 'type': 'server_error', 'param': None, 'code': None}}

Maybe there was a problem with the re-encoding, maybe there is an undocumented audio length limit?

grossir commented 4 weeks ago

Some datapoints from testing a draft of the transcription command #4102

Using the case name as prompt helps getting names right. For this example case "Tafolla v. S.C.D.A.O. ". Without case name:

Tofolla, I'm not sure of the pronunciation, versus Heilig. Thank you. So Mr. Bergstein, before we start, pronounce your client's last name. Tofolla. Tofolla. That's how I do it. OK, well, you know better than we do. All right, so you have 10 minutes, but you reserve three minutes for rebuttal. Correct. So you may proceed. OK, we have two reasonable accommodation periods relevant to this case. The first one involved plaintiff's office interaction with Joseph Carroll on January 7, 2014, when we argue the jury can find that defendants violated the ADA when Carroll ordered plaintiff to archive the closed files because the order contradicted the medical note that plaintiff's doctor prepared a couple days ago. Let me ask you about that, Mr. Tofolla, because as I was reading the briefs on this, it just seemed like this case- He's Bergstein. I'm sorry. Bergstein. Thank you. You got Tofolla in my head. This case seems to be a disagreement over what the doctor notes said. It seems to me that the

With case name:

or Tafolla, I'm not sure of the pronunciation, versus Heilig. So Mr. Bergstein, before we start, pronounce your client's last name. Tafolla. Tafolla, that's how I do it. Okay, well, you know better than we do. All right, so you have ten minutes, but you reserve three minutes for rebuttal. Correct. So you may proceed. Okay, we have two reasonable accommodation periods relevant to this case. The first one involved plaintiff's office interaction with Joseph Carroll on January 7, 2014, when we argue the jury can find that defendants violated the ADA when Carroll ordered plaintiff to archive the closed files because the order contradicted the medical note that plaintiff's doctor prepared a couple days earlier. Let me ask you about that, Mr. Tafolla, because as I was reading the briefs on this, it just seemed like this case- He's Bergstein. I'm sorry. Bergstein. You got Tafolla in my head. This case seems to be a disagreement over what the doctor notes said. It seems to me that they were wil

Response time is fast, in a direct linear relationship with the audio duration. As a caveat, they were sequential requests

I did some manual testing, I think around ~30 requests. Also, I tested the command by running it against 97 audio files. However, we got billed 319 API requests. I don't know the reason for the difference

mlissner commented 4 weeks ago

Also, I tested the command by running it against 97 audio files. However, we got billed 319 API requests. I don't know the reason for the difference

I went and looked. No idea either, but we should keep an eye on this and make sure we understand it before we run up a 3× larger bill than we expect.

waldoj commented 4 weeks ago

That's a great observation on priming the transcription with the case name! That would seem to indicate that it would be helpful to include any metadata in the prompt that might appear within the audio.

mlissner commented 4 weeks ago

Maybe a bunch of legalese, but I can't think of a good list? I guess we'll have to see what's bad and plug holes as we see them.

waldoj commented 4 weeks ago

I was only thinking in terms of the name of the judge, the names of the lawyers, that kind of thing, but that's a good point! If there were any specialized legal terms that transcription consistently got wrong, the prompt could prime for them.

flooie commented 4 weeks ago

What about using the final opinion text for priming the audio

mlissner commented 4 weeks ago

We won't have that when we get recordings in the future, but even at this point we don't have them linked (though we should). V2, perhaps!

flooie commented 4 weeks ago

You know what are posted with oral arguments are briefs

mlissner commented 1 week ago

Just to follow up here: We've generated transcripts for nearly every oral argument file in CL. We are doing some cleanup for:

868 audio files that didn't get processed the first time (status 0)
103 that failed (status 2)
1481 that hallucinated (status 3)
1893 that are > 25MB (status 4)
182 with missing files (status 5)

freelawproject / courtlistener