Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Generalize unigram and bigram ingest methods #134

Open organisciak opened 7 years ago

organisciak commented 7 years ago

create_unigram_book_counts and create_bigram_book_counts are redundant. Refactoring may make sense so that the updates made to one don't need to be copy-pasted. Ultimately, the functions are the same, just arguments and naming are different.

bmschmidt commented 7 years ago

It would be good for this method to include a two variables that specifies the bits used to store the wordids and bookids.

Just sketching it out, something like this.

    def create_wordcount_table(ngrams, wordid_bytes = 3, bookid_bytes = 3):
        """
        wordid_bytes: 3 or 4. The number of bytes to store wordids; 3 reduces file sizes by 25% 
                   and may speed up queries, but limits the vocabulary to 16 million words.
        bookid_bytes: 3 or 4. the number of bytes to store wordids; 3 reduces file sizes by 25% 
                   and may speed up queries, but limits the library to 16 million documents.
        """
         vartypes = {3:"MEDIUMINT UNSIGNED", 4: "INT UNSIGNED"}
         table_string = "TABLE word1 {}, bookid {}, count MEDIUMINT UNSIGNED".format(vartypes[wordid_bytes],vartype[bookid_bytes])

I know of one group that has hacked at the code to allow bookid to be an INT UNSIGNED rather than MEDIUMINT UNSIGNED, which is necessary if ingesting more the 16 million volumes. There is a little work that needs to be done in other places before this support is total, but it would be nice to lay the groundwork here.

A two-byte int goes to 65,000 and a one-byte int to 255. I can imagine a few cases where these might be useful if you're using a Bookworm to store named entities rather than actual words. But space is unlikely to be as big a deal in those cases as in the base one. 3 and 4 are the only ones necessary to support.