Closed evamaxfield closed 1 year ago
Merging #228 (b5f1027) into main (92b0484) will decrease coverage by
2.14%
. The diff coverage is95.00%
.
@@ Coverage Diff @@
## main #228 +/- ##
==========================================
- Coverage 72.68% 70.55% -2.14%
==========================================
Files 64 62 -2
Lines 3581 3298 -283
==========================================
- Hits 2603 2327 -276
+ Misses 978 971 -7
Impacted Files | Coverage Δ | |
---|---|---|
cdp_backend/tests/utils/test_file_utils.py | 100.00% <ø> (ø) |
|
cdp_backend/utils/file_utils.py | 88.29% <ø> (+0.34%) |
:arrow_up: |
cdp_backend/pipeline/event_gather_pipeline.py | 82.63% <80.00%> (-2.12%) |
:arrow_down: |
...ckend/tests/pipeline/test_event_gather_pipeline.py | 98.31% <80.00%> (-0.86%) |
:arrow_down: |
cdp_backend/sr_models/whisper.py | 97.72% <97.72%> (ø) |
|
cdp_backend/pipeline/pipeline_config.py | 100.00% <100.00%> (ø) |
|
cdp_backend/sr_models/__init__.py | 100.00% <100.00%> (ø) |
|
cdp_backend/tests/sr_models/test_whisper.py | 100.00% <100.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Note: it seems like this PR is a "code line count increase" but that is actually just a JSON file bloating the number, this is a lot of deleted code :tada:
While I have high, high confidence that Eva has done a great job, this one sounds like I need to be especially attentive. So I'm going to wait until Friday/weekend to really look through this, if this is acceptable.
While I have high, high confidence that Eva has done a great job, this one sounds like I need to be especially attentive. So I'm going to wait until Friday/weekend to really look through this, if this is acceptable.
Totally fine! I have other changes to make to cookie cutter in the meantime and I need to run a full test on my personal staging instance.
I'll also look at this sometime over the long weekend!
This is great! From everything you've told me it seems like Whisper is the way to go over Google speech-to-text
I had a few questions for clarification:
I answered @isaacna's questions in Slack DM but here are the answers publicly
I had a few questions for clarification:
1. Whisper is entirely free (for now) right? If the transcription itself is free, where does the cost in the spreadsheet based on various Whisper model sizes come from? Are the costs just from data storage or the cost of the compute itself? I guess I'm just not sure where the associated cost based on model size comes from if whipser is free and we're running the event gather pipeline via the VM provided by github actions
There is a new framework available that is really cool called "CML" (Continuous Machine Learning) from Iterative: https://github.com/iterative/cml
It allows you to use GitHub actions to spin up a "self hosted GitHub actions runner on AWS / GCP / Azure" run your stuff like a normal github action and then it auto tears it down for you. So the cost is the GCP instance BUT you get the logs and all the nicities of GH Actions and we get a GPU which we will in general want in the future too.. example: https://github.com/evamaxfield/gcloud-whisper-testing/blob/main/.github/workflows/runner.yml#L70
the deploy-runner-gcp-gpu creates a GCP compute instance and loads the GitHub Actions Self Hosted Runner software onto it, makes all the VPC connections, and all the other stuff required. Then the test-gcp-t4 job uses that GCP compute instance to act like a normal GitHub Action just running on the instance itself so we have access to more space, more RAM, more storage, and a GPU but the cost is lower overall because Google Speech-to-Text has such a high overhead vs running your own infra + model
2. This depends on 1, but yeah I'm curious what other cloud providers' cost would be. If we're running transcription on GCP are we using cloud functions? Could be an interesting cost comparison with AWS lambda
it spins up a machine. so "GCP Cloud Compute" or "AWS EC2". Lambdas or "Cloud Functions" aren't used because while the benefit of a lambda is way less "spin up time" but the benefit of this is "auto-create, auto-teardown, and more resources" -- generally have to pay the cost of ~10 minutes of spin up time for the instance but 7 minutes is somewhat negligible imo
3. Would it be worth leaving the webvtt model in as an option of it's significantly cheaper?
i think it is an option if we want to leave it in for cost saving reasons but the quality of the whisper medium and large models is generally better than closed captions imo. What I am planning on doing is writing a script to get the average duration of meetings per day across a few different cities and using those as baselines for cost. I expect to see < $10/month even with whisper though.
Link to Relevant Issue
Not related but resolves: #200 (lol)
Description of Changes
This is a large PR that touches the core of the pipeline. The main goal is implementing whisper as our transcription engine. I ripped out webvtt, google cloud, etc. Whisper has a lower word-error-rate than both (or at least from my glancing at hundreds of transcripts).
Note: because the diff in
event_gather_pipeline.py
is so large, github auto collapses it, be sure to open it and review those changes.Things I would love to hear feedback on:
ADDITIONALLY: this PR includes the license change to Apache v2.