Open-Book-Genome-Project / sequencer

A toolchain of tasks for sequencing and fingerprinting book fulltext

43 stars 14 forks source link

URL's should be preserved as soon as they are found.

Reference code:

pragma.archivelab.org/pragma/api/pragmas.py Lines 51 to 72 in https://github.com/ArchiveLabs/pragma.archivelab.org/commit/7587f6de4cd380bfb10ba56cea376e38d7124ec5

def save(url):
    url = url if '://' in url else 'http://' + url
    r = requests.get('http://web.archive.org/save/%s' % url)    
    if 'X-Archive-Wayback-Runtime-Error' in r.headers:
        return {
            'error': r.headers['X-Archive-Wayback-Runtime-Error']
        }
    print(r.headers)
    content_location = r.headers.get('content-location', url)
    if 'x-archive-wayback-liveweb-error' in r.headers:
        raise core.HTTPException(r.headers['x-archive-wayback-liveweb-error'],
                                 r.status_code)
    protocol = 'https' if 'https://' in content_location else 'http'
    uri = content_location.split("://")[1] 
    path = uri[uri.index('/'):] if uri.index('/') is not None else '/';
    return {
        'date': r.headers['date'],
        'protocol': protocol,
        'domain': uri.split('/')[0],
        'path': path,
        'id': content_location
    }

Scope

Daily cron that runs @jjjake's derive module to generate genomes for books, for the resulting genome.json, test curl any urls, and then archive in wayback. This is also related to #51 and and https://github.com/internetarchive/openlibrary/issues/8756 as the same job should/could likely also handle TOC identification and extraction.

Open Library has ~1M books containing URLS: https://openlibrary.org/search/inside?q=http%3A%2F%2F&mode=everything We have a project called the Open bookgenomeproject.org which uses a sequencer bot to read books and produce "book_genome" files with insights
Here are ~13k books which have book_genomes: https://archive.org/search.php?query=format%3Abook_genome
For one of these items having ID isbn_9791570594938, you can view its genome by going to: https://archive.org/download/isbn_9791570594938/book_genome.json
Within the book_genome.json, we can see 2 urls were discovered in the book above
The IA tool can be used to download book_genome.json files for items: https://archive.org/services/docs/api/internetarchive/internetarchive.html
There is a feature calls Save Page Now which can be used to save these urls in the Wayback Machine: https://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved
I've also created an API which one can programatically use to accomplish this... https://github.com/ArchiveLabs/pragma.archivelab.org/blob/master/pragma/api/pragmas.py#L51-L72

The hard part is discovering URLs, but @finnless has a dump of all the book_genome.json files and can answer questions about figuring out which ones contain urls!

Open-Book-Genome-Project / sequencer

Save extracted URL's in Wayback Machine #73

Scope