Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Allow resuming of unigram ingest #135

Closed organisciak closed 7 years ago

organisciak commented 7 years ago

In create_unigram_book_counts, there are four steps:

  1. Drop unigram table, if exists
  2. Create unigram table and DISABLE KEYS
  3. LOAD DATA INFILE staged unigrams into database
  4. Finish up with ENABLE KEYS

I've been tinkering with flags that allow a big ingest to be partially completed, and picked up later.

In my current edits, all of these steps still happen, but flags can select parts of the process. --no-delete skips 1, --no-close skips 4, and --close-only only does 4 (superseding other flags). 2 always runs, but is updated to create the table "IF NOT EXISTS".

Is that an acceptable approach?

bmschmidt commented 7 years ago

I endorse this approach, thanks for working it out. Instead of calling it "close" I might call it "index" or something to be more clear about why the stage takes so long.

organisciak commented 7 years ago

Renamed to index. Also, I had a typo in my --index-only code, which meant I deleted a 5m book table when I tried to index it - and the code is better for it!