Use the Internet Archive API instead of savepagenow

josh-chamberlain commented 1 year ago

for initial archives: see if the website already exists: https://archive.org/developers/tutorial-get-snapshot-wayback.html
- if it does not exist -> archive it and update the last cached date
- check update delta:
- if delta is null -> move url to inactive cache list and label as archived, add archive date
for ongoing archives:
- check integrity of all scheduled source urls
- move broken source urls to inactive cache list, label as broken, add date of breakage
- archive all scheduled source urls and update last cached date
- populate next schedule based on last cached date, update delta, and next archive date
compare current versions of a page to previously cached pages to determine archiving needs instead of using approximate timedeltas
- https://archive.org/developers/tutorial-compare-snapshot-wayback.html
schedule caching using Archive API task scheduler instead of using sequential cache requests
- https://archive.org/developers/tasks.html

josh-chamberlain commented 1 year ago

@drowninginflowers can you say a little more about why you put this in the original TODOs / why it would be beneficial to use the internet archive API instead of savepagenow?

drowninginflowers commented 1 year ago

savepage now is pretty time intensive (sometimes as long as 3-5 minutes per HTTP request) and runs locally. This is bad because, to my knowledge, Github Actions only allows for a 3 hour activity window before it auto-kills the process and there are far greater than 60 sources (the max that could be executed in one task action) in the database. Additionally, savepagenow has little other functionality that is helpful to the process of checking and archiving sources. The Archive API, on the other hand, has useful features such as comparing a current source to historical versions of itself to check for changes which is useful for checking if it needs to be rearchived as well as seeing how it has changed over time if people would find that useful. Additionally, my basic understanding is that the API has an internal task scheduler which allows us to make all the archive requests in one Github Action, wait, then execute another action to check the status of all the archive requests. This breaks the archiving process into two parts, most of which is executed externally, allowing us to execute a much larger archive request without overrunning the Github Actions time limits.

kalenluciano commented 1 year ago

@mbodeantor for this issue, I had to make changes to both the automatic-archives repo and the data-sources-app repo.

For the data-sources-app repo, I added two "archives" routes. The first one gets all data sources that either have a last_cached value of NULL or an update_frequency value that isn't NULL as well as a non-broken source_url (tracked by a broken_source_url_as_of column that will be null unless it has proven to be broken). The second route updates the table after the automatic-archives runs through and archives each data source, updating the last_cached value (and any broken_source_url_as_of values if applicable). Both of these are pointing to a test database (test_data_sources) since we'd need to add a column for last_cached and broken_source_url_as_of and I didn't want to mess with the original data_sources table until given the greenlight.

For the automatic-archives, I consolidated all of the code into the cache_url.py script (removing the url_cache_builder.py since that was for extracting the data_sources from the JSON file and we're trying to move toward using the Data Sources API). After consolidating, the main things I did was pointing the script to the Data Sources API for sources to archive and then using the Internet Archive API to do what we were already doing in savepagenow.

One last note: I'm not sure what the future plans for the automatic-archives repo is, but given that it's basically all running in one script, I'm wondering if it would make sense to integrate this into the Data Sources API instead of having a separate repo. We'd just need to set up a workflow to run this script on a regular basis. I'm not super familiar with setting up workflows, but I imagine it wouldn't be that difficult since we already have the update.yml file in the automatic-archives repo.

kalenluciano commented 1 year ago

Re-opening the issue for the second have of this issue (the automatic-archives PR)

Police-Data-Accessibility-Project / automatic-archives

Use the Internet Archive API instead of savepagenow #5