Note end_time in the above output from a PIttsburgh, PA captions processing. Turns out that event does have a caption file in the Legistar data structure for the event but it is in fact empty.
We want to filter these out just attempt speech-to-text.
Use Case
Allow for proper transcript generation in generate_transcript() by appropriately filtering out empty caption files.
Solution
Compare the lengths of the video and the caption file. If they differ by more than some threshold, e.g. 20%, throw away the caption file.
This means the scraper can no longer just hand off the caption file URL as-is.
Alternatives
Throw away caption files with file size less than some threshold, e.g. 100 bytes.
Feature Description
Note
end_time
in the above output from a PIttsburgh, PA captions processing. Turns out that event does have a caption file in the Legistar data structure for the event but it is in fact empty.We want to filter these out just attempt speech-to-text.
Use Case
Allow for proper transcript generation in
generate_transcript()
by appropriately filtering out empty caption files.Solution
Compare the lengths of the video and the caption file. If they differ by more than some threshold, e.g. 20%, throw away the caption file. This means the scraper can no longer just hand off the caption file URL as-is.
Alternatives