freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
528 stars 144 forks source link

Add sentiment analysis to oral argument transcripts #4135

Open mlissner opened 2 months ago

mlissner commented 2 months ago

Now that we've got about 100k transcripts in our oral argument collection, perhaps a next step would be to add sentiment analysis. I think this is pretty easy stuff these days either through an AI call or more basic techniques. I could see adding this to the search engine so it's alertable being really neat too:

Some potential headlines:

Could be cool. Too much to do!

legaltextai commented 1 month ago

I think I could do that for you. Not only sentiment, but also things like "extract all questions asked by Justice Alito" , or "what were the main issues discussed in case X" Where can I find the transcripts? Are they split by speakers?

mlissner commented 1 month ago

Cool! If you're taking this on, a few thoughts:

  1. Let's start small. What's the smallest, simplest thing we can do here that would make a useful feature?

  2. We'll want to integrate this into all our services. For example:

    • The API
    • Bulk data
    • Search engine
    • Podcasts (this comes from search engines)
    • Webpages (the /audio/123/slug/ pages)

To do the above, we usually need three pieces at least:

  1. The first thing to do is to update the database model. Do you know enough Django to do a PR to do that?
  2. From there, the next thing to do is usually to update the pipeline to start processing new content as it comes in.
  3. Finally, we'll want a script of some kind to process the existing cases.

Once that's all done, we'll have the data part in place, and we'll want to update the API, search, etc to start exposing it. Do a blog post, etc.


Where can I find the transcripts?

They're on the audio API.

Are they split by speakers?

No, unfortunately they're not. I just created #4203 to start that discussion, but we really haven't done much with it yet. Not sure if it should come before or after this feature though.

legaltextai commented 1 month ago

This is a very simple prototype. I just uploaded the transcripts from the SC website for the Argument Session: April 15, 2024 - April 25, 2024 and played with the prompt . It could be implemented and trained in more detail , but this is just to demonstrate the feasibility. I normally use Python + Postgres + some libraries for embedding and interfaces with LLMs + FastAPI

legaltextai commented 1 month ago

sorry, where can i find the transcriptions here? https://www.courtlistener.com/api/rest/v3/audio/

legaltextai commented 1 month ago
Screenshot 2024-07-15 at 2 56 23 PM

an example of transcription and speaker identification by assemblyai this was the mp3 https://archive.org/download/gov.uscourts.ca9.23-55690/gov.uscourts.ca9.23-55690.2024-07-12.mp3

legaltextai commented 1 month ago

if you could share with me one of your transcript, i could see how to prompt the model to return the text divided by speakers

ERosendo commented 1 month ago

sorry, where can i find the transcriptions here? https://www.courtlistener.com/api/rest/v3/audio/

@legaltextai The text transcription you're looking for is stored in the field named stt_transcript. The following URL will show a list of OA files where the transcription process has been completed successfully:

https://www.courtlistener.com/api/rest/v3/audio/?stt_status=1

legaltextai commented 1 month ago

got it , thanks!

legaltextai commented 1 month ago

From what I have read, diarization is an audio-based task, so a text-only approach may not be optimal, plus might be costly (basically, you are going through the same files twice). Unless, again, you get free credits from Anthropic or OpenAI . How many hours of audio do you have in your database?

See if this code works for you. You will need to install Anthropic (pip install anthropic) and get their API key.

import anthropic
import textwrap
import re

def divide_transcript_chunk(chunk, last_speaker=None):
    client = anthropic.Anthropic(
        api_key="your_anthropic_api_key",
    )

    prompt = f"""
    You are an expert at dividing transcripts among speakers. 
    Given a chunk of transcript, please divide it among the speakers. 
    Follow these rules strictly:
    1. Format your response as:
       Speaker Name: Their complete dialogue...
    2. Start a new line for each speaker change.
    3. Do not include speaker labels within the dialogue itself.
    4. If a speaker's dialogue is interrupted by another speaker and then continues, combine their dialogue into a single entry.
    5. Do not include any meta-information, instructions, or explanatory text in your response.
    6. If the chunk starts mid-sentence or mid-dialogue, attribute it to the correct speaker if clear from context.
    7. Use consistent speaker names throughout (e.g., "Judge", "Mr. Roth", "Mr. Zellman").
    8. Begin your response immediately with the first speaker's name and dialogue, without any preamble.
    9. If the speaker is not explicitly named, use descriptive titles like "Judge" or "Attorney" based on context.
    10. It is absolutely crucial to include every single word from the input in your output, even if you're unsure about speaker attribution.
    11. Do not summarize, paraphrase, or omit any part of the input. Include everything verbatim.

    {'The last speaker from the previous chunk was: ' + last_speaker if last_speaker else 'This is the start of the transcript.'}
    If this chunk starts with a continuation of that speaker's dialogue, begin with their name.
    """

    query = chunk

    message = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=4000,
        temperature=0,
        system=prompt,
        messages=[{"role": "user", "content": [{"type": "text", "text": query}]}]
    )

    return message.content[0].text if isinstance(message.content, list) else message.content

def clean_divided_chunk(chunk):
    lines = chunk.split('\n')
    cleaned_lines = []
    current_speaker = "Unknown Speaker"
    current_dialogue = []

    for line in lines:
        if ':' in line:
            if current_dialogue:
                cleaned_lines.append(f"{current_speaker}: {' '.join(current_dialogue)}")
            current_speaker, dialogue = line.split(':', 1)
            current_dialogue = [dialogue.strip()]
        else:
            current_dialogue.append(line.strip())

    if current_dialogue:
        cleaned_lines.append(f"{current_speaker}: {' '.join(current_dialogue)}")

    return '\n\n'.join(cleaned_lines)

def remove_overlap(previous_chunk, current_chunk):
    words = current_chunk.split()
    for i in range(len(words)):
        if ' '.join(words[i:]) in previous_chunk:
            return ' '.join(words[i:])
    return current_chunk

def get_last_speaker(chunk):
    matches = re.findall(r'^([A-Za-z\s.]+):', chunk, re.MULTILINE)
    return matches[-1] if matches else None

def process_transcript(transcript):
    chunk_size = 1500  
    overlap = 300  
    words = transcript.split()
    chunks = [' '.join(words[max(0, i-overlap):i+chunk_size]) for i in range(0, len(words), chunk_size-overlap)]

    divided_chunks = []
    previous_chunk = ""
    last_speaker = None
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1} of {len(chunks)}...")
        divided_chunk = divide_transcript_chunk(chunk, last_speaker)
        divided_chunk = clean_divided_chunk(divided_chunk)

        if previous_chunk:
            divided_chunk = remove_overlap(previous_chunk, divided_chunk)

        divided_chunks.append(divided_chunk)
        previous_chunk = divided_chunk
        last_speaker = get_last_speaker(divided_chunk)

    return '\n\n'.join(divided_chunks)

def verify_transcript(original, divided):
    original_words = original.lower().split()
    divided_words = divided.lower().split()

    missing_words = set(original_words) - set(divided_words)
    extra_words = set(divided_words) - set(original_words)

    print(f"Verification Summary:")
    print(f"Original word count: {len(original_words)}")
    print(f"Divided word count: {len(divided_words)}")
    print(f"Words potentially missing: {len(missing_words)}")
    print(f"Words potentially added: {len(extra_words)}")

    if missing_words:
        print("\nSome missing words:", list(missing_words)[:20])
    if extra_words:
        print("\nSome extra words:", list(extra_words)[:20])

    return len(missing_words) == 0 and len(extra_words) == 0

# Read the transcript from a file
#with open('transcript.txt', 'r') as file:
    #transcript = file.read()

divided_transcript = process_transcript(transcript)

is_verified = verify_transcript(transcript, divided_transcript)

with open('divided_transcript.txt', 'w') as file:
    file.write(divided_transcript)

print("Transcript division complete. Results written to 'divided_transcript_5.txt'.")
if is_verified:
    print("Verification passed: All content accounted for.")
else:
    print("Verification note: Some differences detected. Check the divided transcript for details.")

Attached is a diarization using this script over the transcript I found here

divided_transcript_7.txt

mlissner commented 1 month ago

That's pretty great! I just reached out to AssemblyAI to see if we can forget a partnership there. It's wild times.

legaltextai commented 1 month ago

I hope it works out. If it does not, may be re-running whisper with diarization options like this one or this one could be an option. My offer to use my server with GPU still stands. PS. If you end up using AssemblyAI, think of those future use cases you mentioned and see what 's the best output format you'd like Assembly to produce and prompt accordingly. For example, you may decide a json output would be the best for your needs. PPS. If you want, I can work with AssemblyAI on the next steps to get it done.

legaltextai commented 1 month ago

That's pretty great! I just reached out to AssemblyAI to see if we can forget a partnership there. It's wild times.

You meant Assembly, not Anthropic, right?

mlissner commented 1 month ago

I did mean Assembly, and I said Assembly? :)

legaltextai commented 1 month ago

Here is another prototype that incorporates embedding, vector storage, semantic search, reranking , and analysis by gpt-4o. Based on all cases from 2023 term

mlissner commented 1 month ago

That gave me a 404?

legaltextai commented 1 month ago

sorry , try this one https://helpmefindcase.com/supreme_court_transcripts

mlissner commented 1 month ago

image

legaltextai commented 1 month ago
Screenshot 2024-07-17 at 10 32 09 AM

you got there as i was changing a couple of things :-)

mlissner commented 1 month ago

That is pretty fun. Impressive how you can crank these out!