Closed Shak2000 closed 2 years ago
Merging #206 (da1c0a1) into main (1e5b807) will increase coverage by
0.21%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #206 +/- ##
==========================================
+ Coverage 72.29% 72.51% +0.21%
==========================================
Files 64 64
Lines 3404 3416 +12
==========================================
+ Hits 2461 2477 +16
+ Misses 943 939 -4
Impacted Files | Coverage Δ | |
---|---|---|
cdp_backend/sr_models/webvtt_sr_model.py | 100.00% <100.00%> (ø) |
|
...dp_backend/tests/sr_models/test_webvtt_sr_model.py | 100.00% <100.00%> (ø) |
|
cdp_backend/tests/conftest.py | 100.00% <0.00%> (ø) |
|
cdp_backend/tests/utils/test_file_utils.py | 100.00% <0.00%> (ø) |
|
cdp_backend/utils/file_utils.py | 94.82% <0.00%> (+2.51%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
@Shak2000 , given this comment from To, I think this PR can change. i.e., It's possible the change should be to just update the regex pattern defined in WebVTTSRModel.__init__()
.
Agree with @dphoria, can you try adding the optional regex and see if it works?
I think to fix the mentioned issue, you'd need to do more than patch the code by removing the >>
in the normalization step.
Currently, the code is supposed to remove any >>
(new speaker turn pattern) from the final transcript. See https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/sr_models/webvtt_sr_model.py#L87.
There is a clear bug that the code is not finding all the >>
in the vtt file because some are present in the Oakland transcript.
You'd need to investigate what the captions
look like for an Oakland vtt file
# Read the caption file
captions = webvtt.read(file_uri)
And then investigate how the captions
are processed in _get_speaker_turns
.
@tohuynh Good catch! After doing some more investigating, it seems that there is already a configuration for each CDP instance to define caption_new_speaker_turn_pattern
. I tested it locally, and all we need seems to be to configure it as >
. I will drop this pull request and create a pull request for Oregon.
Link to Relevant Issue
This pull request resolves #203
Description of Changes
Removing
>>
from the transcripts