Open mlissner opened 3 months ago
I think I could do that for you. Not only sentiment, but also things like "extract all questions asked by Justice Alito" , or "what were the main issues discussed in case X" Where can I find the transcripts? Are they split by speakers?
Cool! If you're taking this on, a few thoughts:
Let's start small. What's the smallest, simplest thing we can do here that would make a useful feature?
We'll want to integrate this into all our services. For example:
/audio/123/slug/
pages)To do the above, we usually need three pieces at least:
Once that's all done, we'll have the data part in place, and we'll want to update the API, search, etc to start exposing it. Do a blog post, etc.
Where can I find the transcripts?
They're on the audio API.
Are they split by speakers?
No, unfortunately they're not. I just created freelawproject/foresight#21 to start that discussion, but we really haven't done much with it yet. Not sure if it should come before or after this feature though.
This is a very simple prototype. I just uploaded the transcripts from the SC website for the Argument Session: April 15, 2024 - April 25, 2024 and played with the prompt . It could be implemented and trained in more detail , but this is just to demonstrate the feasibility. I normally use Python + Postgres + some libraries for embedding and interfaces with LLMs + FastAPI
sorry, where can i find the transcriptions here? https://www.courtlistener.com/api/rest/v3/audio/
an example of transcription and speaker identification by assemblyai this was the mp3 https://archive.org/download/gov.uscourts.ca9.23-55690/gov.uscourts.ca9.23-55690.2024-07-12.mp3
if you could share with me one of your transcript, i could see how to prompt the model to return the text divided by speakers
sorry, where can i find the transcriptions here? https://www.courtlistener.com/api/rest/v3/audio/
@legaltextai The text transcription you're looking for is stored in the field named stt_transcript
. The following URL will show a list of OA files where the transcription process has been completed successfully:
https://www.courtlistener.com/api/rest/v3/audio/?stt_status=1
got it , thanks!
From what I have read, diarization is an audio-based task, so a text-only approach may not be optimal, plus might be costly (basically, you are going through the same files twice). Unless, again, you get free credits from Anthropic or OpenAI . How many hours of audio do you have in your database?
See if this code works for you. You will need to install Anthropic (pip install anthropic) and get their API key.
import anthropic
import textwrap
import re
def divide_transcript_chunk(chunk, last_speaker=None):
client = anthropic.Anthropic(
api_key="your_anthropic_api_key",
)
prompt = f"""
You are an expert at dividing transcripts among speakers.
Given a chunk of transcript, please divide it among the speakers.
Follow these rules strictly:
1. Format your response as:
Speaker Name: Their complete dialogue...
2. Start a new line for each speaker change.
3. Do not include speaker labels within the dialogue itself.
4. If a speaker's dialogue is interrupted by another speaker and then continues, combine their dialogue into a single entry.
5. Do not include any meta-information, instructions, or explanatory text in your response.
6. If the chunk starts mid-sentence or mid-dialogue, attribute it to the correct speaker if clear from context.
7. Use consistent speaker names throughout (e.g., "Judge", "Mr. Roth", "Mr. Zellman").
8. Begin your response immediately with the first speaker's name and dialogue, without any preamble.
9. If the speaker is not explicitly named, use descriptive titles like "Judge" or "Attorney" based on context.
10. It is absolutely crucial to include every single word from the input in your output, even if you're unsure about speaker attribution.
11. Do not summarize, paraphrase, or omit any part of the input. Include everything verbatim.
{'The last speaker from the previous chunk was: ' + last_speaker if last_speaker else 'This is the start of the transcript.'}
If this chunk starts with a continuation of that speaker's dialogue, begin with their name.
"""
query = chunk
message = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=4000,
temperature=0,
system=prompt,
messages=[{"role": "user", "content": [{"type": "text", "text": query}]}]
)
return message.content[0].text if isinstance(message.content, list) else message.content
def clean_divided_chunk(chunk):
lines = chunk.split('\n')
cleaned_lines = []
current_speaker = "Unknown Speaker"
current_dialogue = []
for line in lines:
if ':' in line:
if current_dialogue:
cleaned_lines.append(f"{current_speaker}: {' '.join(current_dialogue)}")
current_speaker, dialogue = line.split(':', 1)
current_dialogue = [dialogue.strip()]
else:
current_dialogue.append(line.strip())
if current_dialogue:
cleaned_lines.append(f"{current_speaker}: {' '.join(current_dialogue)}")
return '\n\n'.join(cleaned_lines)
def remove_overlap(previous_chunk, current_chunk):
words = current_chunk.split()
for i in range(len(words)):
if ' '.join(words[i:]) in previous_chunk:
return ' '.join(words[i:])
return current_chunk
def get_last_speaker(chunk):
matches = re.findall(r'^([A-Za-z\s.]+):', chunk, re.MULTILINE)
return matches[-1] if matches else None
def process_transcript(transcript):
chunk_size = 1500
overlap = 300
words = transcript.split()
chunks = [' '.join(words[max(0, i-overlap):i+chunk_size]) for i in range(0, len(words), chunk_size-overlap)]
divided_chunks = []
previous_chunk = ""
last_speaker = None
for i, chunk in enumerate(chunks):
print(f"Processing chunk {i+1} of {len(chunks)}...")
divided_chunk = divide_transcript_chunk(chunk, last_speaker)
divided_chunk = clean_divided_chunk(divided_chunk)
if previous_chunk:
divided_chunk = remove_overlap(previous_chunk, divided_chunk)
divided_chunks.append(divided_chunk)
previous_chunk = divided_chunk
last_speaker = get_last_speaker(divided_chunk)
return '\n\n'.join(divided_chunks)
def verify_transcript(original, divided):
original_words = original.lower().split()
divided_words = divided.lower().split()
missing_words = set(original_words) - set(divided_words)
extra_words = set(divided_words) - set(original_words)
print(f"Verification Summary:")
print(f"Original word count: {len(original_words)}")
print(f"Divided word count: {len(divided_words)}")
print(f"Words potentially missing: {len(missing_words)}")
print(f"Words potentially added: {len(extra_words)}")
if missing_words:
print("\nSome missing words:", list(missing_words)[:20])
if extra_words:
print("\nSome extra words:", list(extra_words)[:20])
return len(missing_words) == 0 and len(extra_words) == 0
# Read the transcript from a file
#with open('transcript.txt', 'r') as file:
#transcript = file.read()
divided_transcript = process_transcript(transcript)
is_verified = verify_transcript(transcript, divided_transcript)
with open('divided_transcript.txt', 'w') as file:
file.write(divided_transcript)
print("Transcript division complete. Results written to 'divided_transcript_5.txt'.")
if is_verified:
print("Verification passed: All content accounted for.")
else:
print("Verification note: Some differences detected. Check the divided transcript for details.")
Attached is a diarization using this script over the transcript I found here
That's pretty great! I just reached out to AssemblyAI to see if we can forget a partnership there. It's wild times.
I hope it works out. If it does not, may be re-running whisper with diarization options like this one or this one could be an option. My offer to use my server with GPU still stands. PS. If you end up using AssemblyAI, think of those future use cases you mentioned and see what 's the best output format you'd like Assembly to produce and prompt accordingly. For example, you may decide a json output would be the best for your needs. PPS. If you want, I can work with AssemblyAI on the next steps to get it done.
That's pretty great! I just reached out to AssemblyAI to see if we can forget a partnership there. It's wild times.
You meant Assembly, not Anthropic, right?
I did mean Assembly, and I said Assembly? :)
That gave me a 404?
sorry , try this one https://helpmefindcase.com/supreme_court_transcripts
you got there as i was changing a couple of things :-)
That is pretty fun. Impressive how you can crank these out!
Now that we've got about 100k transcripts in our oral argument collection, perhaps a next step would be to add sentiment analysis. I think this is pretty easy stuff these days either through an AI call or more basic techniques. I could see adding this to the search engine so it's alertable being really neat too:
Some potential headlines:
Could be cool. Too much to do!