Generalize unigram and bigram ingest methods

It would be good for this method to include a two variables that specifies the bits used to store the wordids and bookids.

Just sketching it out, something like this.

    def create_wordcount_table(ngrams, wordid_bytes = 3, bookid_bytes = 3):
        """
        wordid_bytes: 3 or 4. The number of bytes to store wordids; 3 reduces file sizes by 25% 
                   and may speed up queries, but limits the vocabulary to 16 million words.
        bookid_bytes: 3 or 4. the number of bytes to store wordids; 3 reduces file sizes by 25% 
                   and may speed up queries, but limits the library to 16 million documents.
        """
         vartypes = {3:"MEDIUMINT UNSIGNED", 4: "INT UNSIGNED"}
         table_string = "TABLE word1 {}, bookid {}, count MEDIUMINT UNSIGNED".format(vartypes[wordid_bytes],vartype[bookid_bytes])

I know of one group that has hacked at the code to allow bookid to be an INT UNSIGNED rather than MEDIUMINT UNSIGNED, which is necessary if ingesting more the 16 million volumes. There is a little work that needs to be done in other places before this support is total, but it would be nice to lay the groundwork here.

A two-byte int goes to 65,000 and a one-byte int to 255. I can imagine a few cases where these might be useful if you're using a Bookworm to store named entities rather than actual words. But space is unlikely to be as big a deal in those cases as in the base one. 3 and 4 are the only ones necessary to support.

Bookworm-project / BookwormDB

Generalize unigram and bigram ingest methods #134