CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

bugfix/webvtt-parsing-for-oakland #206

Closed Shak2000 closed 1 year ago

Shak2000 commented 2 years ago

Link to Relevant Issue

This pull request resolves #203

Description of Changes

Removing >> from the transcripts

codecov[bot] commented 2 years ago

Codecov Report

Merging #206 (da1c0a1) into main (1e5b807) will increase coverage by 0.21%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #206      +/-   ##
==========================================
+ Coverage   72.29%   72.51%   +0.21%     
==========================================
  Files          64       64              
  Lines        3404     3416      +12     
==========================================
+ Hits         2461     2477      +16     
+ Misses        943      939       -4     
Impacted Files Coverage Δ
cdp_backend/sr_models/webvtt_sr_model.py 100.00% <100.00%> (ø)
...dp_backend/tests/sr_models/test_webvtt_sr_model.py 100.00% <100.00%> (ø)
cdp_backend/tests/conftest.py 100.00% <0.00%> (ø)
cdp_backend/tests/utils/test_file_utils.py 100.00% <0.00%> (ø)
cdp_backend/utils/file_utils.py 94.82% <0.00%> (+2.51%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

dphoria commented 2 years ago

@Shak2000 , given this comment from To, I think this PR can change. i.e., It's possible the change should be to just update the regex pattern defined in WebVTTSRModel.__init__().

evamaxfield commented 2 years ago

Agree with @dphoria, can you try adding the optional regex and see if it works?

tohuynh commented 2 years ago

I think to fix the mentioned issue, you'd need to do more than patch the code by removing the >> in the normalization step.

Currently, the code is supposed to remove any >> (new speaker turn pattern) from the final transcript. See https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/sr_models/webvtt_sr_model.py#L87.

There is a clear bug that the code is not finding all the >> in the vtt file because some are present in the Oakland transcript.

You'd need to investigate what the captions look like for an Oakland vtt file

# Read the caption file
captions = webvtt.read(file_uri)

And then investigate how the captions are processed in _get_speaker_turns.

Shak2000 commented 1 year ago

@tohuynh Good catch! After doing some more investigating, it seems that there is already a configuration for each CDP instance to define caption_new_speaker_turn_pattern. I tested it locally, and all we need seems to be to configure it as >. I will drop this pull request and create a pull request for Oregon.