This bundles a couple of changes that were done around the same time.
Fixing the broken logging, as per #133
Adding --index-only, --no-index, and --no-delete flags to bookworm prep database_wordcounts, resolving #135 (and fixing one bug that came up).
Two small improvements: db.query() supports executemany calls, and there is a backup process for writing csv files to DB from Python if LOAD DATA INFILE fails. Not sure when this might be useful except with a permission error - I wrote it for some benchmarking and figured it could be kept in as a failsafe.
Support for ingest from h5 files. This looks for a table called unigrams inside the file, writes a set of temporary CSVs in parallel, then uses LOAD DATA INFILE. The reason I opted for H5 is because it's well supported in Pandas and contains support for 'blosc', a fast compression algorithm. I tried to keep this code as simple as possible, it would have been easy to over-engineer it.
I started generalizing create_unigram_book_counts, toward eventually being able to convert it to a create_book_counts_table method that create_unigram_book_counts and create_bigram_book_counts can both use. This relates to the discussion in #134. Updates above are currently specific to unigram tables, my use case, so this will allow bigrams indexes to keep pace.
This bundles a couple of changes that were done around the same time.
--index-only
,--no-index
, and--no-delete
flags tobookworm prep database_wordcounts
, resolving #135 (and fixing one bug that came up).db.query()
supportsexecutemany
calls, and there is a backup process for writing csv files to DB from Python if LOAD DATA INFILE fails. Not sure when this might be useful except with a permission error - I wrote it for some benchmarking and figured it could be kept in as a failsafe.h5
files. This looks for a table calledunigrams
inside the file, writes a set of temporary CSVs in parallel, then uses LOAD DATA INFILE. The reason I opted for H5 is because it's well supported in Pandas and contains support for 'blosc', a fast compression algorithm. I tried to keep this code as simple as possible, it would have been easy to over-engineer it.create_unigram_book_counts
, toward eventually being able to convert it to acreate_book_counts_table
method thatcreate_unigram_book_counts
andcreate_bigram_book_counts
can both use. This relates to the discussion in #134. Updates above are currently specific to unigram tables, my use case, so this will allow bigrams indexes to keep pace.