KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

Corpus bag of tokens example; docs and cleanup #2

Closed dginev closed 8 years ago

dginev commented 8 years ago

I started this branch with the initial intention to reimplement the GloVe model, but settled for a much more humble first state, just adding an example capable of generating a token bag of words for a corpus.

I also added a first stab at an ngrams module, and turned on warnings for missing documentation, so that we get a first incentive to add more comments. When the warnings are gone I intend to switch this to hard errors.

dginev commented 8 years ago

I'll leave the PR open for a day, if there are no objections I'll merge it in master.

dginev commented 8 years ago

Ok, finally passed travis - unreasonably much work was needed to adapt to the libxml2 inconsistencies with preserving spaces on older distros.

That being said, I think massaging the differences also lead to catching another rare edge case in the sentence tokenization.

The "bag of token" extractor is currently running on mercury, and looks quite stable 200,000 documents in, so I will merge here. Hope it won't introduce any conflicts with the other project branches!

dginev commented 8 years ago

One bit that should be improved for the next time the token model is generated is to move to a parallel model with one reader per thread, with a mutex-ed read queue and a mutex-ed write buffer.

The current implementation running on mercury looks light in IO load, so there is room for improvement.