Filter out bad caption files

Feature Description

{"generator": "CDP WebVTT Conversion -- CDP v3.2.3", "confidence": 0.97, "session_datetime": "2022-09-14T10:00:00-04:00", "created_datetime": "2022-09-17T15:51:32.537666", "sentences": [{"index": 0, "confidence": 0.97, "start_time": 0.0, "end_time": 0.0, "words": [], "text": "", "speaker_index": 0, "speaker_name": null, "annotations": null}], "annotations": null}

Note end_time in the above output from a PIttsburgh, PA captions processing. Turns out that event does have a caption file in the Legistar data structure for the event but it is in fact empty.

We want to filter these out just attempt speech-to-text.

Use Case

Allow for proper transcript generation in generate_transcript() by appropriately filtering out empty caption files.

Solution

Compare the lengths of the video and the caption file. If they differ by more than some threshold, e.g. 20%, throw away the caption file. This means the scraper can no longer just hand off the caption file URL as-is.

Alternatives

Throw away caption files with file size less than some threshold, e.g. 100 bytes.
Throw away caption files less than ~1 minute.

CouncilDataProject / cdp-backend