Replace current tokenization with nltk.word_tokenize?

Bookworm currently uses an elaborate tokenization regex designed to mimic the 2008 Google Ngrams release. Even the 2011 ngrams release no longer uses that particular set. So it might make sense to chuck all the custom backend and lean more on our dependency with nltk and just use its tokenization function nltk.word_tokenize(), which is closer to 2011 ngrams on certain fronts (most notably, "don't" is tokenized to ["do", "n't"] and "Homer's" to ["Homer","'s"]. I personally don't like this tokenization, but it seems standard enough at this point.

If regex-based tokenization is desired, that could be accomplished through the appropriate nltk function.

Out-of-the box, this seems to work slightly worse on Chinese than the current regex based system, but basically both are a disaster. Bookworm treats each sentence as a token, but at least splits on the U+3002 ideographic full stop character; nltk doesn't even do that courtesy. But there's a possible nltk solution

Both work well enough on Russian with cyrillic characters.

Nope, this is a non-starter. I'm remembering why I always have shied away from ntlk in the past; it's extraordinarily slow. About 5.5x times slower than Bookworm's tokenization.

Under the hood, the nltk tokenizer is a literal port into python of the original sed penn treebank tokenizer script. It seems to have crazy amounts of overhead. (It does a bunch of replacements on the original string, and then splits on whitespace.)

Here's a test with three runs of the old tokenizer and the nltk one. The thing being tokenized is "Huckleberry Finn," rammed against itself 3 times to make it longer.

Usually that takes 0.7 seconds for the bookworm tokenizer, and 3.9 seconds for the nltk one. We can't afford to give up all that time.

def test_1():text = open("/Users/bschmidt/twain.txt").read(); text=text*3; import bookwormDB.tokenizer; f = bookwormDB.tokenizer.tokenizer(text); f.tokenize();

def test_2():text = open("/Users/bschmidt/twain.txt").read(); text=text*3; import bookwormDB.tokenizer; f = bookwormDB.tokenizer.tokenizer(text); f.tokenize_with_nltk();

In [62]: start = time.time(); test_1(); print time.time() - start;
0.721720933914

In [63]: start = time.time(); test_1(); print time.time() - start;
0.708900928497

In [64]: start = time.time(); test_1(); print time.time() - start;
0.712187051773

In [65]: start = time.time(); test_2(); print time.time() - start;
3.91569685936

In [66]: start = time.time(); test_2(); print time.time() - start;
3.9383020401

In [67]: start = time.time(); test_2(); print time.time() - start;
3.96205592155

Bookworm-project / BookwormDB

Replace current tokenization with nltk.word_tokenize? #96