def save(url):
url = url if '://' in url else 'http://' + url
r = requests.get('http://web.archive.org/save/%s' % url)
if 'X-Archive-Wayback-Runtime-Error' in r.headers:
return {
'error': r.headers['X-Archive-Wayback-Runtime-Error']
}
print(r.headers)
content_location = r.headers.get('content-location', url)
if 'x-archive-wayback-liveweb-error' in r.headers:
raise core.HTTPException(r.headers['x-archive-wayback-liveweb-error'],
r.status_code)
protocol = 'https' if 'https://' in content_location else 'http'
uri = content_location.split("://")[1]
path = uri[uri.index('/'):] if uri.index('/') is not None else '/';
return {
'date': r.headers['date'],
'protocol': protocol,
'domain': uri.split('/')[0],
'path': path,
'id': content_location
}
Scope
Daily cron that runs @jjjake's derive module to generate genomes for books, for the resulting genome.json, test curl any urls, and then archive in wayback. This is also related to #51 and and https://github.com/internetarchive/openlibrary/issues/8756 as the same job should/could likely also handle TOC identification and extraction.
The hard part is discovering URLs, but @finnless has a dump of all the book_genome.json files and can answer questions about figuring out which ones contain urls!
URL's should be preserved as soon as they are found.
Reference code:
pragma.archivelab.org/pragma/api/pragmas.py Lines 51 to 72 in https://github.com/ArchiveLabs/pragma.archivelab.org/commit/7587f6de4cd380bfb10ba56cea376e38d7124ec5
Scope