CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

Allow ability to flag Events as "try to scrape again next time" and do so on the next CRON run #212

Open smai-f opened 2 years ago

smai-f commented 2 years ago

Feature Description

A way to flag single events as always needing to be scraped again the next time the event gather CRON action runs, regardless of the datetime parameters passed in.

Use Case

We've another layer of need above our need for processing only part of a video. We know that someone manually adds the timestamps to denote the timestamp range for the part of video a few days after the hearings are uploaded onto the legislature website. It's very conceivable that when we scrape on our CRON schedule, the timestamps wouldn't be there yet.

If I understand correctly, if the timestamps don't exist our options would be to:

Solution

Since the scraper runs via datetime range and doesn't really understand things in the unit of a single event, the MVP of this could be if any events within a datetime range flag needing to be retried, get_events is called for that same datetime range again on the next run.

Ideally, it wouldn't have to re-scrape a whole daterange though and we could provide a function/lambda/callback to run for just the event that needs to be revisited on the next run and it would keep happening until conditions were met for the video to be ingested.

Alternatives

As mentioned above we can overcome this by not using the datetime parameters passed to get_events and scrape everything all the time, and ignore videos that do not have timestamps in the meantime. Scraping everything all the time will probably not be ideal as the number of bills and hearings grow, particularly because our legislature site was literally built in the 90s and hits random errors a lot.

Another idea is we could only programmatically scrape videos that have timestamps within the datetime range, and come back and manually run event gather for the datetime range a few days/weeks after the videos have been added to try again for the timestamps. Also not ideal but could be a backup.