Closed dginev closed 8 years ago
I'll leave the PR open for a day, if there are no objections I'll merge it in master.
Ok, finally passed travis - unreasonably much work was needed to adapt to the libxml2 inconsistencies with preserving spaces on older distros.
That being said, I think massaging the differences also lead to catching another rare edge case in the sentence tokenization.
The "bag of token" extractor is currently running on mercury, and looks quite stable 200,000 documents in, so I will merge here. Hope it won't introduce any conflicts with the other project branches!
One bit that should be improved for the next time the token model is generated is to move to a parallel model with one reader per thread, with a mutex-ed read queue and a mutex-ed write buffer.
The current implementation running on mercury looks light in IO load, so there is room for improvement.
I started this branch with the initial intention to reimplement the GloVe model, but settled for a much more humble first state, just adding an example capable of generating a token bag of words for a corpus.
I also added a first stab at an ngrams module, and turned on warnings for missing documentation, so that we get a first incentive to add more comments. When the warnings are gone I intend to switch this to hard errors.