codeforboston / maple

MAPLE makes it easy for anyone to view and submit testimony to the Massachusetts Legislature about the bills that will shape our future.
https://mapletestimony.org
MIT License
41 stars 109 forks source link

Speech to text - Hearings #539

Open mvictor55 opened 2 years ago

mvictor55 commented 2 years ago

We should try to get transcripts of the hearings - such as https://malegislature.gov/Events/Hearings/Detail/4292

I see two ways to do this 1) Run our own video file/speech to text software, or 2) somehow scrape the closed captions from the hearing video streams

I hope we can add a "hearings" section on a bill page that would allow a user to view the transcript and the hearing

Ideally, we'd be able to break down the hearing into its key parts, like is done here: https://sg001-harmony.sliq.net/00329/Harmony/en/PowerBrowser/PowerBrowserV2/20220607/4/1913#agenda_ There are a lot of legislature websites that are structured in this format - I think they're leveraging "Sliq Media - Harmony" software. https://www.sliq.com/northDakotaCaseStudy.html

From Mitre session: 19:10 in the video; again at 59:30 https://drive.google.com/file/d/1D3olHFsxxk1PZhQTfus_cja_cUrI9qeS/view?usp=share_link

UPDATE 10/24/23 --- this ticket was created a year ago. LLM tech now seems to provide much better solutions.

alexjball commented 2 years ago

I looked into vosk, open source speech-to-text software, and tesseract, open-source OCR (image-to-text) software. Using Vosk, we can generate a subtitle track for the hearing videos, and we can OCR individual frames with Tesseract. Both are quite inaccurate though, and I'm not sure it would add value as an accessibility tool.

Google has a video intelligence API that we could use to detect text and transcribe audio in video. It provides 1000 min/free per month, then after that it would cost $0.20/min. I haven't had a chance to test it yet.

We could potentially use the auto-transcription for searching hearings. Users would search for keywords and we would return hearings and timecodes where they appear.

The ideal solution would be to get the transcription text straight from the state house.

https://alphacephei.com/vosk/ https://github.com/tesseract-ocr/tesseract