Scrape Historical Data - Githubissues

datalab-dev / covid_worksite_exposure

Scraping and visualizing the UC Davis Potential Worksite Exposure Reporting (AB 685) data

MIT License

6 stars 4 forks source link

Scrape Historical Data #3

Closed MicheleTobias closed 2 years ago

MicheleTobias commented 3 years ago

From Tyler Shoemaker:

Wayback has been taking snapshots of the worksite exposure site since January: https://web.archive.org/web/*/https://campusready.ucdavis.edu/potential-exposure

Once we scrape the current data (and set up an automated system going forward), we can consider scraping the data the Wayback Machine has.

MicheleTobias commented 3 years ago

I'm looking for some resources to help with the Internet Archive API. Here's one thing that might be helpful: Official documentation IThe API should have a way to get individual archived pages because that's what the website is built on.

The first two links in the Scrape Internet Content part of this page look useful: https://www.inmotionhosting.com/support/website/backup-and-restore/how-to-recover-your-content-from-wayback-machine-internet-archive/ They are python tools, so if we can figure out together what parameters to use, I can run it (unless you want to work in python, which is totally cool if you do).

MicheleTobias commented 3 years ago

Here's an idea: build the URLs and use wildcards to handle the timestamp.

For example: In https://web.archive.org/web/20210422231440/https://campusready.ucdavis.edu/potential-exposure the 20210422231440 string is the date-time stamp of the snapshot. If you replace it with 20210422?, the API fills in the most recent time stamp. There's generally only one time stamp per day so we could just walk through the calendar days and scrape what's there each day, if anything.

Also, the Wayback Machine Scraper looks really promising to get the page IDs, but I haven't gotten it to work yet.

erklopez commented 3 years ago

I will take a look at all of this, thank you for the help!

MicheleTobias commented 3 years ago

Success! I got the Wayback Scraper to work in the command line to get the URLs!

wayback-scraper -u http://campusready.ucdavis.edu/potential-exposure -o json

erklopez commented 3 years ago

Amazing! I will use that then.

MicheleTobias commented 3 years ago

It worked amazingly well. Let me know if you have questions.

MicheleTobias commented 2 years ago

Handled with PR #22