Closed MicheleTobias closed 2 years ago
I'm looking for some resources to help with the Internet Archive API. Here's one thing that might be helpful: Official documentation IThe API should have a way to get individual archived pages because that's what the website is built on.
The first two links in the Scrape Internet Content part of this page look useful: https://www.inmotionhosting.com/support/website/backup-and-restore/how-to-recover-your-content-from-wayback-machine-internet-archive/ They are python tools, so if we can figure out together what parameters to use, I can run it (unless you want to work in python, which is totally cool if you do).
Here's an idea: build the URLs and use wildcards to handle the timestamp.
For example: In https://web.archive.org/web/20210422231440/https://campusready.ucdavis.edu/potential-exposure
the 20210422231440
string is the date-time stamp of the snapshot. If you replace it with 20210422?
, the API fills in the most recent time stamp. There's generally only one time stamp per day so we could just walk through the calendar days and scrape what's there each day, if anything.
Also, the Wayback Machine Scraper looks really promising to get the page IDs, but I haven't gotten it to work yet.
I will take a look at all of this, thank you for the help!
Success! I got the Wayback Scraper to work in the command line to get the URLs!
wayback-scraper -u http://campusready.ucdavis.edu/potential-exposure -o json
Amazing! I will use that then.
It worked amazingly well. Let me know if you have questions.
Handled with PR #22
From Tyler Shoemaker:
Once we scrape the current data (and set up an automated system going forward), we can consider scraping the data the Wayback Machine has.