dat-ecosystem-archive / svalbard

A global metadata vault [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
62 stars 6 forks source link

Internet Archive Metadata Exporter #11

Open max-mapper opened 7 years ago

max-mapper commented 7 years ago

Need a toolchain for monitoring a IA collection and extracting the .CDX file manifests for each item on an ongoing basis, and convert the URL lists to NDJSON.

Here's my first stab, https://gist.github.com/maxogden/89818ba6f14ab95b9d6051fa14deeb74 but had bugs with the IA API that I didn't know how to fix (was getting duplicates back from the advancedsearch endpoint).

Found out the ia tool in python works much better:

curl -LO https://archive.org/download/ia-pex/ia
chmod +x ia
ia --insecure search 'collection:EndOfTerm2016WebCrawls' --itemlist > eotitems.txt`
bnewbold commented 7 years ago

See also: https://github.com/jjjake/iamine https://archive.org/details/ia_census_201604