ekansa / data-rescue-nps-irma

Data Rescue for US NPS IRMA Database, Metadata and File Manifests
Other
1 stars 0 forks source link

Download Progress #1

Open max-mapper opened 7 years ago

max-mapper commented 7 years ago

Hi, I'm working on an initial download and will post my progress here.

I have a script that grabs the URLs and makes a file I can feed into my parallel downloader

var ids = require('./data/item_ids.json')
var fs = require('fs')
var ndjson = require('ndjson')

var stream = ndjson.serialize()

stream.pipe(process.stdout)

ids.forEach(function (id) {
  var file = fs.readFileSync('./data/' + id + '/' + id + '-files.json').toString()
  var json = JSON.parse(file)
  json.forEach(function (j) {
    stream.write({folder: id, url: j.Url, id: j.Id})
  })
})

stream.end()

Interestingly there are some duplicates:

max@burrito:/media/hd4/data-rescue-nps-irma$ node ndjson.js > urls.json
max@burrito:/media/hd4/data-rescue-nps-irma$ wc -l urls.json 
42144 urls.json
max@burrito:/media/hd4/data-rescue-nps-irma$ cat urls.json | jsonfilter url | sort | uniq | wc -l
40600

I'm starting the download of those 40600 now

max-mapper commented 7 years ago

Early stats from first 500 files (~5gb)

max@burrito:/media/hd4$ cat results.json | grep errMsg | wc -l
17
max@burrito:/media/hd4$ wc -l results.json 
606 results.json
max@burrito:/media/hd4$ du -sh data
5.4G    data

The 17 ones so far that seem to be bad links (the errors above)

{"url":"http://inpredwgis2.nps.doi.net/documents/0005603.pdf","date":"2017-02-15T04:40:24.075Z","id":559841,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2/documents/0005597.pdf","date":"2017-02-15T04:40:24.095Z","id":559824,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2 inpredwgis2:80"}
{"url":"http://ia701209.us.archive.org/21/items/biogeochemistryo00stot/biogeochemistryo00stot.pdf","date":"2017-02-15T04:40:24.118Z","id":466134,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND ia701209.us.archive.org ia701209.us.archive.org:80"}
{"url":"http://ia700702.us.archive.org/2/items/ecologyofsaguaro00sagu/ecologyofsaguaro00sagu.pdf","date":"2017-02-15T04:40:24.148Z","id":466164,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND ia700702.us.archive.org ia700702.us.archive.org:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0005720.pdf","date":"2017-02-15T04:40:24.173Z","id":560167,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0005771.pdf","date":"2017-02-15T04:40:24.174Z","id":560283,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://nrpcsharepoint/climatechange/communication/Bioregional%20Talking%20Points/Forms/AllItems.aspx","date":"2017-02-15T04:40:24.179Z","id":356400,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND nrpcsharepoint nrpcsharepoint:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0006195.pdf","date":"2017-02-15T04:40:24.188Z","id":562153,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"https://science1.nature.nps.gov/naturebib/biodiversity/2010-3-30/UCBN_SageVeg_2009AnnualReport_20100315.pdf","date":"2017-02-15T04:40:24.234Z","id":356992,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND science1.nature.nps.gov science1.nature.nps.gov:443"}
{"url":"http://ia600607.us.archive.org/8/items/impactofhumanuse00chiso/impactofhumanuse00chiso.pdf","date":"2017-02-15T04:40:24.383Z","id":466141,"error":{},"errMsg":"Max retries exceeded: connect ECONNREFUSED 207.241.227.197:80"}
{"url":"http://www.nrmsc.usgs.gov/research/glacier_retreat.htm","date":"2017-02-15T04:40:29.708Z","id":491780,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND www.nrmsc.usgs.gov www.nrmsc.usgs.gov:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0005973.pdf","date":"2017-02-15T04:40:38.311Z","id":560924,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0006214.pdf","date":"2017-02-15T04:40:38.554Z","id":562262,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0005631.pdf","date":"2017-02-15T04:46:05.938Z","id":559914,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://inpredwgis2.nps.doi.net/documents/0006202.pdf","date":"2017-02-15T04:47:40.119Z","id":562168,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND inpredwgis2.nps.doi.net inpredwgis2.nps.doi.net:80"}
{"url":"http://www1.nrintra.nps.gov/ard/research/docs/UCBN_2012_AnnualReport.pdf","date":"2017-02-15T04:52:19.498Z","id":481549,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND www1.nrintra.nps.gov www1.nrintra.nps.gov:80"}
{"url":"http://ipcc-wg2.gov/SREX/","date":"2017-02-15T04:52:39.218Z","id":499833,"error":{},"errMsg":"Max retries exceeded: getaddrinfo ENOTFOUND ipcc-wg2.gov ipcc-wg2.gov:80"}

Seems like the archive.org urls should prob on a different subdomain somewhere on archive.org...

max-mapper commented 7 years ago

For the doi.net ones, the IRMA page has 2 copies of what looks like the same file.

https://irma.nps.gov/DataStore/Reference/Profile/2236980

One link still resolves: https://irma.nps.gov/DataStore/DownloadFile/560613 but it one doesn't seem to be in the metadata.

max@burrito:/media/hd4$ cat data-rescue-nps-irma/urls.json | grep 560613
max@burrito:/media/hd4$ 
ekansa commented 7 years ago

Thanks for the progress update! It looks like I highly over-guestimated the storage requirements for this, which is a relief.

Let me know if there's anything I can do to trouble shoot. I did a smaller scale attempt earlier and noted some of the same problems you're finding, but I don't know if there's much we can do on our end to resolve them without some technical help from the staff maintaining IRMA.

max-mapper commented 7 years ago

Seems like the error rate is pretty low. Here's a status update:

max@burrito:/media/hd4$ du -sh data
239G    data
max@burrito:/media/hd4$ wc -l results.json 
11425 results.json
max@burrito:/media/hd4$ cat results.json | grep errMsg | wc -l
74

So assuming some uniformity in file sizes it's looking around 1TB

max-mapper commented 7 years ago

Seems to have slowed a little, but still progressing:

max@burrito:/media/hd4$ du -sh data
271G    data
max@burrito:/media/hd4$ wc -l results.json 
12501 results.json

I need to add a download speed per host over time feature to my downloader so I can better understand the speed changes (had similar issues when downloading 40tb with this toolchain from data.gov)

ekansa commented 7 years ago

Interesting. The IRMA site goes down from time to time also. Hopefully they can keep it going until your process is completed.

Fingers crossed!

max-mapper commented 7 years ago
max@burrito:/media/hd4$ du -sh data
384G    data
max@burrito:/media/hd4$ wc -l results.json
14862 results.json
ekansa commented 7 years ago

Slowing down lots isn't it? I think IRMA is pretty creaky.

ekansa commented 7 years ago

@maxogden

At the request of an NPS employee, I've made an additional compressed directory, with more IRMA items that require safeguarding. The new directory is "data-2.zip" and is otherwise just like "data.zip" that you are currently processing.

max-mapper commented 7 years ago

@ekansa great thanks!

max-mapper commented 7 years ago

500GB :D

max-mapper commented 7 years ago

797G, 33,000 items done

ekansa commented 7 years ago

Alright! We're getting close to 1 TB! Thanks for monitoring this!

max-mapper commented 7 years ago

OK been a busy week and my download apparently stalled at 797GB but I just restarted it from where it left off and it's progressing again

ekansa commented 7 years ago

Awesome and thanks! I've asked for more insight into other threatened data stores or if this is a comprehensive enough harvest. No word yet, I think things are slow because everyone feels the need to around via back-channels.

max-mapper commented 7 years ago

It's done! I'll do the 2nd zip next

max-mapper commented 7 years ago

Just getting going on data-2 now.

Only issue was

Error: ENOENT: no such file or directory, open './data-2/2224545/2224545-files.json'

But I just ignored that folder (was empty), no big deal

Downloads going now!

ekansa commented 7 years ago

How's this latest batch going?

max-mapper commented 7 years ago

Sorry forgot to respond! The second batch was small and finish pretty quickly. I have about a terabyte of data including all my download logs. Do you have a plan for hosting these long term or should I look into different options? P.s. i'll try to distribute them as a Dat repository soon

ekansa commented 7 years ago

I think it is time to accession into the California Digital Library along with the data.gov captures. We should also ask the Internet Archive for secondary archiving.