Police-Data-Accessibility-Project / automatic-archives

MIT License
1 stars 1 forks source link

alter workflow to automatically add URLs to the active cache list when new entries appear in the database #4

Closed josh-chamberlain closed 9 months ago

josh-chamberlain commented 1 year ago

Requires closing #3

mbodeantor commented 9 months ago

@josh-chamberlain how frequently should new urls be cached as a default? Should all urls be added that aren't currently being cached?

josh-chamberlain commented 9 months ago

@mbodeantor default of weekly seems right based on my experience, or monthly if we want to be more conservative. some sources have a lower retention rate than that, but it's an OK default.

Yes, unless there's some reason we should specifically avoid cacheing them.

mbodeantor commented 9 months ago

After going through the code in depth, I'm pretty sure we're already caching everything in the data_sources table that has a source_url. @kalenluciano do you agree?

Since it seems like that is the case, I'm going to just fill in any blank update_frequency fields with "weekly".

josh-chamberlain commented 9 months ago

@mbodeantor you'll notice there's no standardization to that field, because we didn't know what we were gonna get and some of the update_frequency selections would be, like, "every time there's an officer-involved use of force". LMK if you think we could make a set of options which would cover it, maybe for those we could use incident-based or event-driven.

mbodeantor commented 9 months ago

@josh-chamberlain Yeah I think some standardization would be a great idea. Obviously there is some variation in capitalization but would love your thoughts on the rest of those. Would be great to standardize around one label for each number of days or yeah "incident-based"


        "As new shootings occur": 30,
        "quarterly": 91,
        "Quarterly": 45,
        "<5 Minutes": 1,
        "Monthly": 30,
        "annually": 365,
        "daily": 1,
        "Nightly": 1,
        "BiAnnually": 182,
        "About weekly at least": 7,
        "<2 Weeks": 14,
        "Hourly": 1,
        "Daily": 1,
        "At least once per week": 7,
        "semi-annually": 365,
        "Weekly": 7,
        "weekly or more often": 7,
        "Annually": 365,
        "weekly": 7,
        "Irregularly every few months upon complaint or request.": 121,
        "monthly": 30,
        "Live": 1
    }```
josh-chamberlain commented 9 months ago

OK @mbodeantor , update_frequency and retention_schedule are both selects. I tightened up the options updated the docs. We could use both of these to sort of infer an archive_schedule so maybe we should just have users/the data source ID pipeline populate that field directly and ditch these two.

For incident-based, we could use something like klaxon to monitor.

mbodeantor commented 9 months ago

Cool, I'll update the update_frequency column to reflect the new dropdown

mbodeantor commented 9 months ago

Oh you already did it lol

mbodeantor commented 9 months ago

@josh-chamberlain So currently this script is set to run monthly. I think it makes sense to run it hourly and then just ignore the majority until it actually needs to be updated. Looks like we only have one source labelled < Hourly right now.

mbodeantor commented 9 months ago

@josh-chamberlain Only seeing a couple incident-based ones rn. Maybe we just have these archive weekly for now?

josh-chamberlain commented 9 months ago

totally fine for now @mbodeantor, but let's keep klaxon in mind for the future (I have it saved here)