evamaxfield commented 2 years ago

Description of Changes

Include a description of the proposed changes.

While this is a general improvement, I will credit the push for this work to @ArthurSmid for noticing that our transcription in King County was quite poor. Specifically, the transcription on the land acknowledgement was atrocious.

Different from Seattle, King County doesn't publish closed caption files for us to convert to our transcript format and as such that instance was using Google Speech-to-Text (Google Speech Recognition or GSR) for transcription.

Our original configuration for GSR had served us decently well but with this push I figured it was time to look at ways to improve it.

PR Changes

The most basic change is to the model selection itself. We now use the enhanced ("video") model for speech-to-text. Generally this costs more, but if we turn on data logging (where Google gets to keep the audio file for their own datasets) the cost is nullified and returns to our normal amount. So for us, this means that we basically get a free upgrade since our data is already public. More info on the upgraded model here
The next, finer detail change, is the improvements to our speech adapation / model adaption. We currently provide event metadata to the model object such as people names, bill abstracts, and more which definitely helps, but one of the things I have been noticing is that our transcripts fail at place names (street addresses, etc.), dollar amounts, reporting ordinals (percents), and more. This adds class tokens that specifically attempt to solve those problems! More info on class tokens here
Finally, I am simply improving the model metadata. Changing the interaction type from discussion to phone call. Google specifically sites "videos of discussions" or "conference calls" should use "phone call" instead of "discussion." Basically, we should have never been using discussion, even when meetings were in-person. "Discussion" means everyone is in the same room, recorded by the same mic -- it would be like if we were having a meeting at a coffee shop and I simply wanted to record the meeting of us talking.

Results

I made a dev deployment for myself that I will likely use for storing experiments like this in the future. I chose a meeting from King County that had noticably bad transcription as the baseline. Full details here: https://github.com/JacksonMaxfield/cdp-dev/tree/main/speech-recognition-config-tests

Baseline transcript: https://jacksonmaxfield.github.io/cdp-dev/#/events/1126b685f94d (note: the minutes item on that event is incorrect, I had a bug in my event details generator that overwrote the minutes of that event the next time I ran it. It is truthfully the baseline -- cdp-backend==3.0.2)

Comments: noticably bad transcription on the land acknowledgement and further down when they start getting into the discussion on bills and such, bad transcription on things like: "pages X to Y." But overall is just missing some words and has some oddities throughout.
Basic Upgrades: https://jacksonmaxfield.github.io/cdp-dev/#/events/6f15f3db0b19 (note: this minutes item has the correct commit for this test, you can see how it overwrote the prior because the minutes item name is the same)

Comments: This includes a massive upgrade. The model, the adaption, and the metadata iteraction type, were all upgraded in this test. It was hard to test them all independently / impossible to, because apparently, certain class tokens only work with the enhanced models anyway. But there are drastic improvements over the base, but there are also now weird alphanumeric sequences introduced to the transcript. Likely because this test I ran with the alphanumeric sequence class token enabled but I didn't expect it to take over that much.
Same Massive Upgrades - Remove Alphanumeric Class: https://jacksonmaxfield.github.io/cdp-dev/#/events/38fa2d6e0603 (note: this commit link is correct, still had a bug but at least it created a new minutes item to track :joy:)

Comments: This is, imo, the best version of the transcript. There are still problems with people's names and problems with numeric sequences such as "bill 2020-1038" but, after the next test, I still think this is the best.
Same Massive Upgrades - Replace $YEAR and $POSTCODE with $NUMERIC_SEQUENCE: https://jacksonmaxfield.github.io/cdp-dev/#/events/7d4212911c66 (note: yayyy i finally figured out how to keep the commits / minutes items intact)

Comments: This is basically a test to see if we can fix the above issue bill reference / number problems but unfortunately, I now see slightly more errors because anytime someone uses a "{number} to {number}" this could be "item 5 to 9" or "pages 10 to 20" or similar, it makes them a single sequence like "items 529" and "pages 10220". And regardless, the bill reference / number problem still isn't fixed entirely because the TRUE bill number includes a hyphen in the middle and this doesn't capture that. So I rolled this commit back.

Summary

:heavy_check_mark: land acknowledgements seem fixed (at least for this meeting)
:heavy_check_mark: overall transcription seems to be improved (with no added cost thanks to data logging)
:heavy_check_mark: better understanding of google speech-to-text possibilities
:heavy_check_mark: prototyped a nice method and a dev infrastructure for these types of tests in the future

Further Changes Needed

The only change that needs to happen outside of this repo is somewhere in the cookiecutter setup processes (both to the github bot and the manual deployment steps), I need to make a PR that informs people they need to turn on data logging.

On the King County side, I won't be reprocessing the November - Today data with these new additions, just too much money. But moving forward we should see an immidiate benefit.

Also included in this PR is a very minor change to dev infrastructure management to make infrastructure management "safer" by requiring a key for which infrastructure to clean rather than simply defaulting to the last created infrastructure.

codecov[bot] commented 2 years ago

Codecov Report

Merging #150 (7c76417) into main (7b1efc3) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #150   +/-   ##
=======================================
  Coverage   94.82%   94.83%           
=======================================
  Files          50       50           
  Lines        2532     2534    +2     
=======================================
+ Hits         2401     2403    +2     
  Misses        131      131

Impacted Files	Coverage Δ
cdp_backend/sr_models/google_cloud_sr_model.py	`98.66% <100.00%> (+0.03%)`	:arrow_up:
...kend/tests/sr_models/test_google_cloud_sr_model.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7b1efc3...7c76417. Read the comment docs.

nniiicc commented 2 years ago

Big if true

nniiicc commented 2 years ago

Yo. Yoooo. YOOOOOOO...

the speech adapation + phone call configuraiton is a massive upgrade. This is great @JacksonMaxfield !!!! We should come up with a baseline quality metric and then work from that each time (can discuss more off thread).

evamaxfield commented 2 years ago

Also, do you think it's worth adding the class tokens $OOV_CLASS_ALPHANUMERIC_SEQUENCE and $OOV_CLASS_TEMPERATURE?

The second test I ran included ALPHANUMERIC but it actually made certain parts worse. The temperature... eh I don't know. I think we leave it out because its very rare, even in a bill about climate, for anyone to be discussing exact temperatures.

dphoria commented 2 years ago

Very neat / awesome to physically see the changes / improvements between the results you listed in the OP. :raised_hands:

evamaxfield commented 2 years ago

the speech adapation + phone call configuraiton is a massive upgrade. This is great @JacksonMaxfield !!!! We should come up with a baseline quality metric and then work from that each time (can discuss more off thread).

@nniiicc A part of me wants to say we should simply run this upgraded model against the seattle closed caption generated transcripts and do a text diff? Basically, "how close does the speech-to-text model get to mirroring the 'gov created closed captions?'"

Edit: errr clarification, we could simply use the closed caption files as "ground truth" and compute word error rate

kristopher-smith commented 2 years ago

Great stuff here @JacksonMaxfield. The King County text looks much cleaner! Those "Custom Classes" they mention under the tokens look curious to me. May be able to utilize those for this funny legislation language at some point in the model.

CouncilDataProject / cdp-backend

feature/improved-gsr #150

Description of Changes

PR Changes

Results

Summary

Further Changes Needed

Codecov Report