Open organisciak opened 7 years ago
It would be good for this method to include a two variables that specifies the bits used to store the wordids and bookids.
Just sketching it out, something like this.
def create_wordcount_table(ngrams, wordid_bytes = 3, bookid_bytes = 3):
"""
wordid_bytes: 3 or 4. The number of bytes to store wordids; 3 reduces file sizes by 25%
and may speed up queries, but limits the vocabulary to 16 million words.
bookid_bytes: 3 or 4. the number of bytes to store wordids; 3 reduces file sizes by 25%
and may speed up queries, but limits the library to 16 million documents.
"""
vartypes = {3:"MEDIUMINT UNSIGNED", 4: "INT UNSIGNED"}
table_string = "TABLE word1 {}, bookid {}, count MEDIUMINT UNSIGNED".format(vartypes[wordid_bytes],vartype[bookid_bytes])
I know of one group that has hacked at the code to allow bookid to be an INT UNSIGNED
rather than MEDIUMINT UNSIGNED
, which is necessary if ingesting more the 16 million volumes. There is a little work that needs to be done in other places before this support is total, but it would be nice to lay the groundwork here.
A two-byte int goes to 65,000 and a one-byte int to 255. I can imagine a few cases where these might be useful if you're using a Bookworm to store named entities rather than actual words. But space is unlikely to be as big a deal in those cases as in the base one. 3 and 4 are the only ones necessary to support.
create_unigram_book_counts
andcreate_bigram_book_counts
are redundant. Refactoring may make sense so that the updates made to one don't need to be copy-pasted. Ultimately, the functions are the same, just arguments and naming are different.