CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

feature/improved-gsr #150

Closed evamaxfield closed 2 years ago

evamaxfield commented 2 years ago

Description of Changes

Include a description of the proposed changes.

While this is a general improvement, I will credit the push for this work to @ArthurSmid for noticing that our transcription in King County was quite poor. Specifically, the transcription on the land acknowledgement was atrocious.

Different from Seattle, King County doesn't publish closed caption files for us to convert to our transcript format and as such that instance was using Google Speech-to-Text (Google Speech Recognition or GSR) for transcription.

Our original configuration for GSR had served us decently well but with this push I figured it was time to look at ways to improve it.

PR Changes

Results

I made a dev deployment for myself that I will likely use for storing experiments like this in the future. I chose a meeting from King County that had noticably bad transcription as the baseline. Full details here: https://github.com/JacksonMaxfield/cdp-dev/tree/main/speech-recognition-config-tests

Summary

Further Changes Needed

The only change that needs to happen outside of this repo is somewhere in the cookiecutter setup processes (both to the github bot and the manual deployment steps), I need to make a PR that informs people they need to turn on data logging.

On the King County side, I won't be reprocessing the November - Today data with these new additions, just too much money. But moving forward we should see an immidiate benefit.


Also included in this PR is a very minor change to dev infrastructure management to make infrastructure management "safer" by requiring a key for which infrastructure to clean rather than simply defaulting to the last created infrastructure.

codecov[bot] commented 2 years ago

Codecov Report

Merging #150 (7c76417) into main (7b1efc3) will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #150   +/-   ##
=======================================
  Coverage   94.82%   94.83%           
=======================================
  Files          50       50           
  Lines        2532     2534    +2     
=======================================
+ Hits         2401     2403    +2     
  Misses        131      131           
Impacted Files Coverage Δ
cdp_backend/sr_models/google_cloud_sr_model.py 98.66% <100.00%> (+0.03%) :arrow_up:
...kend/tests/sr_models/test_google_cloud_sr_model.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7b1efc3...7c76417. Read the comment docs.

nniiicc commented 2 years ago

Big if true

nniiicc commented 2 years ago

Yo. Yoooo. YOOOOOOO...

the speech adapation + phone call configuraiton is a massive upgrade. This is great @JacksonMaxfield !!!! We should come up with a baseline quality metric and then work from that each time (can discuss more off thread).

evamaxfield commented 2 years ago

Also, do you think it's worth adding the class tokens $OOV_CLASS_ALPHANUMERIC_SEQUENCE and $OOV_CLASS_TEMPERATURE?

The second test I ran included ALPHANUMERIC but it actually made certain parts worse. The temperature... eh I don't know. I think we leave it out because its very rare, even in a bill about climate, for anyone to be discussing exact temperatures.

dphoria commented 2 years ago

Very neat / awesome to physically see the changes / improvements between the results you listed in the OP. :raised_hands:

evamaxfield commented 2 years ago

the speech adapation + phone call configuraiton is a massive upgrade. This is great @JacksonMaxfield !!!! We should come up with a baseline quality metric and then work from that each time (can discuss more off thread).

@nniiicc A part of me wants to say we should simply run this upgraded model against the seattle closed caption generated transcripts and do a text diff? Basically, "how close does the speech-to-text model get to mirroring the 'gov created closed captions?'"

Edit: errr clarification, we could simply use the closed caption files as "ground truth" and compute word error rate

kristopher-smith commented 2 years ago

Great stuff here @JacksonMaxfield. The King County text looks much cleaner! Those "Custom Classes" they mention under the tokens look curious to me. May be able to utilize those for this funny legislation language at some point in the model.