Closed josh-chamberlain closed 9 months ago
@josh-chamberlain how frequently should new urls be cached as a default? Should all urls be added that aren't currently being cached?
@mbodeantor default of weekly
seems right based on my experience, or monthly
if we want to be more conservative. some sources have a lower retention rate than that, but it's an OK default.
Yes, unless there's some reason we should specifically avoid cacheing them.
After going through the code in depth, I'm pretty sure we're already caching everything in the data_sources table that has a source_url. @kalenluciano do you agree?
Since it seems like that is the case, I'm going to just fill in any blank update_frequency fields with "weekly".
@mbodeantor you'll notice there's no standardization to that field, because we didn't know what we were gonna get and some of the update_frequency selections would be, like, "every time there's an officer-involved use of force". LMK if you think we could make a set of options which would cover it, maybe for those we could use incident-based
or event-driven
.
@josh-chamberlain Yeah I think some standardization would be a great idea. Obviously there is some variation in capitalization but would love your thoughts on the rest of those. Would be great to standardize around one label for each number of days or yeah "incident-based"
"As new shootings occur": 30,
"quarterly": 91,
"Quarterly": 45,
"<5 Minutes": 1,
"Monthly": 30,
"annually": 365,
"daily": 1,
"Nightly": 1,
"BiAnnually": 182,
"About weekly at least": 7,
"<2 Weeks": 14,
"Hourly": 1,
"Daily": 1,
"At least once per week": 7,
"semi-annually": 365,
"Weekly": 7,
"weekly or more often": 7,
"Annually": 365,
"weekly": 7,
"Irregularly every few months upon complaint or request.": 121,
"monthly": 30,
"Live": 1
}```
OK @mbodeantor , update_frequency
and retention_schedule
are both selects. I tightened up the options updated the docs. We could use both of these to sort of infer an archive_schedule
so maybe we should just have users/the data source ID pipeline populate that field directly and ditch these two.
For incident-based, we could use something like klaxon to monitor.
Cool, I'll update the update_frequency column to reflect the new dropdown
Oh you already did it lol
@josh-chamberlain So currently this script is set to run monthly. I think it makes sense to run it hourly and then just ignore the majority until it actually needs to be updated. Looks like we only have one source labelled < Hourly right now.
@josh-chamberlain Only seeing a couple incident-based ones rn. Maybe we just have these archive weekly for now?
totally fine for now @mbodeantor, but let's keep klaxon in mind for the future (I have it saved here)
Requires closing #3