CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

feature/matter-text-extraction #231

Closed sagarrat7 closed 1 year ago

sagarrat7 commented 1 year ago

Link to Relevant Issue

Related to #81

Description of Changes

Adds utility function parse_document() that extracts text from docx, doc, pdf, and pptx matters to be used in indexing matters in addition to transcripts. Note: pptx files contain extra "Title" and "/docProps/thumbnail.jpeg". This can be removed if needed.

codecov[bot] commented 1 year ago

Codecov Report

Merging #231 (757e3f8) into main (703d7f1) will increase coverage by 0.65%. The diff coverage is 97.75%.

@@            Coverage Diff             @@
##             main     #231      +/-   ##
==========================================
+ Coverage   72.12%   72.78%   +0.65%     
==========================================
  Files          50       50              
  Lines        3376     3465      +89     
==========================================
+ Hits         2435     2522      +87     
- Misses        941      943       +2     
Impacted Files Coverage Δ
cdp_backend/tests/utils/test_file_utils.py 98.88% <92.85%> (-1.12%) :arrow_down:
cdp_backend/tests/conftest.py 100.00% <100.00%> (ø)
cdp_backend/utils/file_utils.py 92.36% <100.00%> (+1.88%) :arrow_up: