Open max-mapper opened 7 years ago
Early stats from first 500 files (~5gb)
max@burrito:/media/hd4$ cat results.json | grep errMsg | wc -l
17
max@burrito:/media/hd4$ wc -l results.json
606 results.json
max@burrito:/media/hd4$ du -sh data
5.4G data
The 17 ones so far that seem to be bad links (the errors above)
{"url":"http://inpredwgis2.nps.doi.net/documents/0005603.pdf","date":"2017-02-15T04:40:24.075Z","id":559841,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2/documents/0005597.pdf","date":"2017-02-15T04:40:24.095Z","id":559824,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2 inpredwgis2:80"}
{"url":"http://ia701209.us.archive.org/21/items/biogeochemistryo00stot/biogeochemistryo00stot.pdf","date":"2017-02-15T04:40:24.118Z","id":466134,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND ia701209.us.archive.org ia701209.us.archive.org:80"}
{"url":"http://ia700702.us.archive.org/2/items/ecologyofsaguaro00sagu/ecologyofsaguaro00sagu.pdf","date":"2017-02-15T04:40:24.148Z","id":466164,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND ia700702.us.archive.org ia700702.us.archive.org:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0005720.pdf","date":"2017-02-15T04:40:24.173Z","id":560167,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0005771.pdf","date":"2017-02-15T04:40:24.174Z","id":560283,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://nrpcsharepoint/climatechange/communication/Bioregional%20Talking%20Points/Forms/AllItems.aspx","date":"2017-02-15T04:40:24.179Z","id":356400,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND nrpcsharepoint nrpcsharepoint:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0006195.pdf","date":"2017-02-15T04:40:24.188Z","id":562153,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"https://science1.nature.nps.gov/naturebib/biodiversity/2010-3-30/UCBN_SageVeg_2009AnnualReport_20100315.pdf","date":"2017-02-15T04:40:24.234Z","id":356992,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND science1.nature.nps.gov science1.nature.nps.gov:443"}
{"url":"http://ia600607.us.archive.org/8/items/impactofhumanuse00chiso/impactofhumanuse00chiso.pdf","date":"2017-02-15T04:40:24.383Z","id":466141,"error":{},"errMsg":"Max retries exceeded: connect ECONNREFUSED 207.241.227.197:80"}
{"url":"http://www.nrmsc.usgs.gov/research/glacier_retreat.htm","date":"2017-02-15T04:40:29.708Z","id":491780,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND www.nrmsc.usgs.gov www.nrmsc.usgs.gov:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0005973.pdf","date":"2017-02-15T04:40:38.311Z","id":560924,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0006214.pdf","date":"2017-02-15T04:40:38.554Z","id":562262,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0005631.pdf","date":"2017-02-15T04:46:05.938Z","id":559914,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0006202.pdf","date":"2017-02-15T04:47:40.119Z","id":562168,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://www1.nrintra.nps.gov/ard/research/docs/UCBN_2012_AnnualReport.pdf","date":"2017-02-15T04:52:19.498Z","id":481549,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND www1.nrintra.nps.gov www1.nrintra.nps.gov:80"}
{"url":"http://ipcc-wg2.gov/SREX/","date":"2017-02-15T04:52:39.218Z","id":499833,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND ipcc-wg2.gov ipcc-wg2.gov:80"}
Seems like the archive.org urls should prob on a different subdomain somewhere on archive.org...
For the doi.net ones, the IRMA page has 2 copies of what looks like the same file.
https://irma.nps.gov/DataStore/Reference/Profile/2236980
One link still resolves: https://irma.nps.gov/DataStore/DownloadFile/560613 but it one doesn't seem to be in the metadata.
max@burrito:/media/hd4$ cat data-rescue-nps-irma/urls.json | grep 560613
max@burrito:/media/hd4$
Thanks for the progress update! It looks like I highly over-guestimated the storage requirements for this, which is a relief.
Let me know if there's anything I can do to trouble shoot. I did a smaller scale attempt earlier and noted some of the same problems you're finding, but I don't know if there's much we can do on our end to resolve them without some technical help from the staff maintaining IRMA.
Seems like the error rate is pretty low. Here's a status update:
max@burrito:/media/hd4$ du -sh data
239G data
max@burrito:/media/hd4$ wc -l results.json
11425 results.json
max@burrito:/media/hd4$ cat results.json | grep errMsg | wc -l
74
So assuming some uniformity in file sizes it's looking around 1TB
Seems to have slowed a little, but still progressing:
max@burrito:/media/hd4$ du -sh data
271G data
max@burrito:/media/hd4$ wc -l results.json
12501 results.json
I need to add a download speed per host over time feature to my downloader so I can better understand the speed changes (had similar issues when downloading 40tb with this toolchain from data.gov)
Interesting. The IRMA site goes down from time to time also. Hopefully they can keep it going until your process is completed.
Fingers crossed!
max@burrito:/media/hd4$ du -sh data
384G data
max@burrito:/media/hd4$ wc -l results.json
14862 results.json
Slowing down lots isn't it? I think IRMA is pretty creaky.
@maxogden
At the request of an NPS employee, I've made an additional compressed directory, with more IRMA items that require safeguarding. The new directory is "data-2.zip" and is otherwise just like "data.zip" that you are currently processing.
@ekansa great thanks!
500GB :D
797G, 33,000 items done
Alright! We're getting close to 1 TB! Thanks for monitoring this!
OK been a busy week and my download apparently stalled at 797GB but I just restarted it from where it left off and it's progressing again
Awesome and thanks! I've asked for more insight into other threatened data stores or if this is a comprehensive enough harvest. No word yet, I think things are slow because everyone feels the need to around via back-channels.
It's done! I'll do the 2nd zip next
Just getting going on data-2 now.
Only issue was
Error: ENOENT: no such file or directory, open './data-2/2224545/2224545-files.json'
But I just ignored that folder (was empty), no big deal
Downloads going now!
How's this latest batch going?
Sorry forgot to respond! The second batch was small and finish pretty quickly. I have about a terabyte of data including all my download logs. Do you have a plan for hosting these long term or should I look into different options? P.s. i'll try to distribute them as a Dat repository soon
I think it is time to accession into the California Digital Library along with the data.gov captures. We should also ask the Internet Archive for secondary archiving.
Hi, I'm working on an initial download and will post my progress here.
I have a script that grabs the URLs and makes a file I can feed into my parallel downloader
Interestingly there are some duplicates:
I'm starting the download of those 40600 now