Bookworm-project / Hathitrust-Bookworm

A full text Bookworm on Public Domain Hathitrust works
6 stars 1 forks source link

Hathitrust Bookworm

Code for setting up a Hathitrust full-text Bookworm using the HathiTrust Research Center's Extracted Features dataset, the HTRC Metadata API, and the Hathitrust Hathifiles.

This repository is still in-development and being documented, for assistance setting it up contact organis2@illinois.edu.

Process

Bookworm needs the following information for indexing:

Because of the scale of the HathiTrust collection, most of these files need custom preparation outside of Bookworm's general purpose indexing processes.

The general indexing process:

Preparing Data

There are multiple approaches possible. The approach being taken currently is #2, but it is worth thinking about both possibilities. Currently, we are saving raw token counts in HDF5 (using PyTables 'table' format through Pandas) while working on the tokenlist, then encoding the counts by iterating through the HDF5 store in chunks.

1 . Two-pass: wordlist, then pre-culled doc-token counts

Write speed is a limiting factor when saving doc-token counts, so knowing beforehand which ones we won't keep can be fast, even if it means reading through the collection twice.

Steps

this could use a global word list like from Google NGrams Dataset, and skip step 1

  1. Single pass against feature files
    • Read Feature Files, saving both word list info and "raw" tokencounts (all tokens, unencoded)
    • Encode tokencounts.