Open max-mapper opened 7 years ago
Need a toolchain for monitoring a IA collection and extracting the .CDX file manifests for each item on an ongoing basis, and convert the URL lists to NDJSON.
Here's my first stab, https://gist.github.com/maxogden/89818ba6f14ab95b9d6051fa14deeb74 but had bugs with the IA API that I didn't know how to fix (was getting duplicates back from the advancedsearch endpoint).
Found out the ia tool in python works much better:
ia
curl -LO https://archive.org/download/ia-pex/ia chmod +x ia ia --insecure search 'collection:EndOfTerm2016WebCrawls' --itemlist > eotitems.txt`
See also: https://github.com/jjjake/iamine https://archive.org/details/ia_census_201604
Need a toolchain for monitoring a IA collection and extracting the .CDX file manifests for each item on an ongoing basis, and convert the URL lists to NDJSON.
Here's my first stab, https://gist.github.com/maxogden/89818ba6f14ab95b9d6051fa14deeb74 but had bugs with the IA API that I didn't know how to fix (was getting duplicates back from the advancedsearch endpoint).
Found out the
ia
tool in python works much better: