Open nickrsan opened 7 years ago
I tried to archive this site using brozzler which didn't work (javascript widgets didn't render), and then wget --mirror followed up testing on a VM which did render with default filters settings..
However I discovered at least three URLs that are backed by server side filtering code.
I could probably implement a simple django site that fairly faithfully reproduces the site, but would it be better to go look for something else to archive?
I could probably implement a simple django site that fairly faithfully reproduces the site, but would it be better to go look for something else to archive?
Is there a bulk/raw data download available anywhere on the www.ncdc.noaa.gov/billions site? If so, that would be great to have. If not, it's probably best to move on to the next issue. These sort of dynamic sites are too difficult to archive in the short time that we have. Thanks!
Yes, there were some raw data links, I grabbed them I extracted my mirror into an apache document root and browsed it. with default settings the pages load the same as the official site, but obviously the filters don't work because its just a static mirror.
I wrote a quick README describing the extra steps I did above and beyond running wget, copied in the brozzler produced warc.gz and zipped the whole thing up. Even with the duplication the zip file is only 15MB
https://drive.google.com/open?id=0B76qh7pWLKB3TF9xTXBuTktBOU0 zip file https://drive.google.com/open?id=0B76qh7pWLKB3UWNtUGJYMnhUTnc zip.gpg file
I made some manual librarian-ing progress constructing some metadata for this over at https://github.com/daniellecrobinson/Data-Rescue-PDX/issues/20
Name: Billion-Dollar Weather and Climate Disasters Organization: NOAA Description URL: https://www.ncdc.noaa.gov/billions/ Download URL: File Types: Size: Status: