CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

bugfix/clean-pictograms-from-transcripts-before-indexing #165

Closed evamaxfield closed 2 years ago

evamaxfield commented 2 years ago

Link to Relevant Issue

This pull request resolves #151

Description of Changes

Include a description of the proposed changes.

Finally found time to fix this one. Can't believe the bug either.

Looks like a transcript from Seattle has a pictogram / emoticon in it... link -- start at 30:09 -- or search for

This fixes the pipeline by adding a function to clean all common pictograms / emojis from the sentence before stemming and fuzzy matching for context spans.

Tested by running the pipeline and storing the index locally: run_cdp_event_index -n 1 --store_local --parallel ../configs-and-special-events/seattle.json

codecov[bot] commented 2 years ago

Codecov Report

Merging #165 (b5babdc) into main (ede007f) will decrease coverage by 0.07%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #165      +/-   ##
==========================================
- Coverage   94.56%   94.49%   -0.08%     
==========================================
  Files          50       50              
  Lines        2558     2560       +2     
==========================================
  Hits         2419     2419              
- Misses        139      141       +2     
Impacted Files Coverage Δ
cdp_backend/pipeline/event_index_pipeline.py 85.71% <ø> (ø)
cdp_backend/tests/utils/test_string_utils.py 100.00% <100.00%> (ø)
cdp_backend/utils/string_utils.py 81.39% <100.00%> (-3.98%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ede007f...b5babdc. Read the comment docs.

evamaxfield commented 2 years ago

Looks good to me! Honestly pretty suprised that Google speech-to-text generated an emoji haha

I think this transcript is from a converted closed caption. Which makes a bit more sense 😂