A way to flag single events as always needing to be scraped again the next time the event gather CRON action runs, regardless of the datetime parameters passed in.
Use Case
We've another layer of need above our need for processing only part of a video. We know that someone manually adds the timestamps to denote the timestamp range for the part of video a few days after the hearings are uploaded onto the legislature website. It's very conceivable that when we scrape on our CRON schedule, the timestamps wouldn't be there yet.
If I understand correctly, if the timestamps don't exist our options would be to:
fall back on scraping the whole video, which we don't want as they can be several hours long (dealbreaker)
skip the video and don't ingest it, but then the next CRON run the video would not get picked up again (dealbreaker)
ignore the datetime range passed in, and scrape everything every time
Solution
Since the scraper runs via datetime range and doesn't really understand things in the unit of a single event, the MVP of this could be if any events within a datetime range flag needing to be retried, get_events is called for that same datetime range again on the next run.
Ideally, it wouldn't have to re-scrape a whole daterange though and we could provide a function/lambda/callback to run for just the event that needs to be revisited on the next run and it would keep happening until conditions were met for the video to be ingested.
Alternatives
As mentioned above we can overcome this by not using the datetime parameters passed to get_events and scrape everything all the time, and ignore videos that do not have timestamps in the meantime. Scraping everything all the time will probably not be ideal as the number of bills and hearings grow, particularly because our legislature site was literally built in the 90s and hits random errors a lot.
Another idea is we could only programmatically scrape videos that have timestamps within the datetime range, and come back and manually run event gather for the datetime range a few days/weeks after the videos have been added to try again for the timestamps. Also not ideal but could be a backup.
Feature Description
A way to flag single events as always needing to be scraped again the next time the event gather CRON action runs, regardless of the datetime parameters passed in.
Use Case
We've another layer of need above our need for processing only part of a video. We know that someone manually adds the timestamps to denote the timestamp range for the part of video a few days after the hearings are uploaded onto the legislature website. It's very conceivable that when we scrape on our CRON schedule, the timestamps wouldn't be there yet.
If I understand correctly, if the timestamps don't exist our options would be to:
Solution
Since the scraper runs via datetime range and doesn't really understand things in the unit of a single event, the MVP of this could be if any events within a datetime range flag needing to be retried,
get_events
is called for that same datetime range again on the next run.Ideally, it wouldn't have to re-scrape a whole daterange though and we could provide a function/lambda/callback to run for just the event that needs to be revisited on the next run and it would keep happening until conditions were met for the video to be ingested.
Alternatives
As mentioned above we can overcome this by not using the datetime parameters passed to
get_events
and scrape everything all the time, and ignore videos that do not have timestamps in the meantime. Scraping everything all the time will probably not be ideal as the number of bills and hearings grow, particularly because our legislature site was literally built in the 90s and hits random errors a lot.Another idea is we could only programmatically scrape videos that have timestamps within the datetime range, and come back and manually run event gather for the datetime range a few days/weeks after the videos have been added to try again for the timestamps. Also not ideal but could be a backup.