CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

Filter out bad caption files #213

Closed dphoria closed 2 years ago

dphoria commented 2 years ago

Feature Description

{"generator": "CDP WebVTT Conversion -- CDP v3.2.3", "confidence": 0.97, "session_datetime": "2022-09-14T10:00:00-04:00", "created_datetime": "2022-09-17T15:51:32.537666", "sentences": [{"index": 0, "confidence": 0.97, "start_time": 0.0, "end_time": 0.0, "words": [], "text": "", "speaker_index": 0, "speaker_name": null, "annotations": null}], "annotations": null}

Note end_time in the above output from a PIttsburgh, PA captions processing. Turns out that event does have a caption file in the Legistar data structure for the event but it is in fact empty.

We want to filter these out just attempt speech-to-text.

Use Case

Allow for proper transcript generation in generate_transcript() by appropriately filtering out empty caption files.

Solution

Compare the lengths of the video and the caption file. If they differ by more than some threshold, e.g. 20%, throw away the caption file. This means the scraper can no longer just hand off the caption file URL as-is.

Alternatives