CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

feature/whisper #228

Closed evamaxfield closed 1 year ago

evamaxfield commented 1 year ago

Link to Relevant Issue

Not related but resolves: #200 (lol)

Description of Changes

This is a large PR that touches the core of the pipeline. The main goal is implementing whisper as our transcription engine. I ripped out webvtt, google cloud, etc. Whisper has a lower word-error-rate than both (or at least from my glancing at hundreds of transcripts).

Note: because the diff in event_gather_pipeline.py is so large, github auto collapses it, be sure to open it and review those changes.

Things I would love to hear feedback on:

ADDITIONALLY: this PR includes the license change to Apache v2.

codecov[bot] commented 1 year ago

Codecov Report

Merging #228 (b5f1027) into main (92b0484) will decrease coverage by 2.14%. The diff coverage is 95.00%.

@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
- Coverage   72.68%   70.55%   -2.14%     
==========================================
  Files          64       62       -2     
  Lines        3581     3298     -283     
==========================================
- Hits         2603     2327     -276     
+ Misses        978      971       -7     
Impacted Files Coverage Δ
cdp_backend/tests/utils/test_file_utils.py 100.00% <ø> (ø)
cdp_backend/utils/file_utils.py 88.29% <ø> (+0.34%) :arrow_up:
cdp_backend/pipeline/event_gather_pipeline.py 82.63% <80.00%> (-2.12%) :arrow_down:
...ckend/tests/pipeline/test_event_gather_pipeline.py 98.31% <80.00%> (-0.86%) :arrow_down:
cdp_backend/sr_models/whisper.py 97.72% <97.72%> (ø)
cdp_backend/pipeline/pipeline_config.py 100.00% <100.00%> (ø)
cdp_backend/sr_models/__init__.py 100.00% <100.00%> (ø)
cdp_backend/tests/sr_models/test_whisper.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

evamaxfield commented 1 year ago

Note: it seems like this PR is a "code line count increase" but that is actually just a JSON file bloating the number, this is a lot of deleted code :tada:

dphoria commented 1 year ago

While I have high, high confidence that Eva has done a great job, this one sounds like I need to be especially attentive. So I'm going to wait until Friday/weekend to really look through this, if this is acceptable.

evamaxfield commented 1 year ago

While I have high, high confidence that Eva has done a great job, this one sounds like I need to be especially attentive. So I'm going to wait until Friday/weekend to really look through this, if this is acceptable.

Totally fine! I have other changes to make to cookie cutter in the meantime and I need to run a full test on my personal staging instance.

isaacna commented 1 year ago

I'll also look at this sometime over the long weekend!

isaacna commented 1 year ago

This is great! From everything you've told me it seems like Whisper is the way to go over Google speech-to-text

I had a few questions for clarification:

  1. Whisper is entirely free (for now) right? If the transcription itself is free, where does the cost in the spreadsheet based on various Whisper model sizes come from? Are the costs just from data storage or the cost of the compute itself? I guess I'm just not sure where the associated cost based on model size comes from if whipser is free and we're running the event gather pipeline via the VM provided by github actions
  2. This depends on 1, but yeah I'm curious what other cloud providers' cost would be. If we're running transcription on GCP are we using cloud functions? Could be an interesting cost comparison with AWS lambda
  3. Would it be worth leaving the webvtt model in as an option of it's significantly cheaper?
evamaxfield commented 1 year ago

I answered @isaacna's questions in Slack DM but here are the answers publicly

I had a few questions for clarification:

1. Whisper is entirely free (for now) right? If the transcription itself is free, where does the cost in the spreadsheet based on various Whisper model sizes come from? Are the costs just from data storage or the cost of the compute itself? I guess I'm just not sure where the associated cost based on model size comes from if whipser is free and we're running the event gather pipeline via the VM provided by github actions

There is a new framework available that is really cool called "CML" (Continuous Machine Learning) from Iterative: https://github.com/iterative/cml

It allows you to use GitHub actions to spin up a "self hosted GitHub actions runner on AWS / GCP / Azure" run your stuff like a normal github action and then it auto tears it down for you. So the cost is the GCP instance BUT you get the logs and all the nicities of GH Actions and we get a GPU which we will in general want in the future too.. example: https://github.com/evamaxfield/gcloud-whisper-testing/blob/main/.github/workflows/runner.yml#L70

the deploy-runner-gcp-gpu creates a GCP compute instance and loads the GitHub Actions Self Hosted Runner software onto it, makes all the VPC connections, and all the other stuff required. Then the test-gcp-t4 job uses that GCP compute instance to act like a normal GitHub Action just running on the instance itself so we have access to more space, more RAM, more storage, and a GPU but the cost is lower overall because Google Speech-to-Text has such a high overhead vs running your own infra + model

2. This depends on 1, but yeah I'm curious what other cloud providers' cost would be. If we're running transcription on GCP are we using cloud functions? Could be an interesting cost comparison with AWS lambda

it spins up a machine. so "GCP Cloud Compute" or "AWS EC2". Lambdas or "Cloud Functions" aren't used because while the benefit of a lambda is way less "spin up time" but the benefit of this is "auto-create, auto-teardown, and more resources" -- generally have to pay the cost of ~10 minutes of spin up time for the instance but 7 minutes is somewhat negligible imo

3. Would it be worth leaving the webvtt model in as an option of it's significantly cheaper?

i think it is an option if we want to leave it in for cost saving reasons but the quality of the whisper medium and large models is generally better than closed captions imo. What I am planning on doing is writing a script to get the average duration of meetings per day across a few different cities and using those as baselines for cost. I expect to see < $10/month even with whisper though.