Open-Book-Genome-Project / sequencer

A toolchain of tasks for sequencing and fingerprinting book fulltext
https://bookgenomeproject.org
43 stars 14 forks source link

Save extracted URL's in Wayback Machine #73

Open finnless opened 3 years ago

finnless commented 3 years ago

URL's should be preserved as soon as they are found.

Reference code:

pragma.archivelab.org/pragma/api/pragmas.py Lines 51 to 72 in https://github.com/ArchiveLabs/pragma.archivelab.org/commit/7587f6de4cd380bfb10ba56cea376e38d7124ec5

def save(url):
    url = url if '://' in url else 'http://' + url
    r = requests.get('http://web.archive.org/save/%s' % url)    
    if 'X-Archive-Wayback-Runtime-Error' in r.headers:
        return {
            'error': r.headers['X-Archive-Wayback-Runtime-Error']
        }
    print(r.headers)
    content_location = r.headers.get('content-location', url)
    if 'x-archive-wayback-liveweb-error' in r.headers:
        raise core.HTTPException(r.headers['x-archive-wayback-liveweb-error'],
                                 r.status_code)
    protocol = 'https' if 'https://' in content_location else 'http'
    uri = content_location.split("://")[1] 
    path = uri[uri.index('/'):] if uri.index('/') is not None else '/';
    return {
        'date': r.headers['date'],
        'protocol': protocol,
        'domain': uri.split('/')[0],
        'path': path,
        'id': content_location
    }

Scope

mekarpeles commented 3 years ago

The hard part is discovering URLs, but @finnless has a dump of all the book_genome.json files and can answer questions about figuring out which ones contain urls!