KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

Senna integration #1

Closed jfschaefer closed 8 years ago

jfschaefer commented 8 years ago

integrated senna into the iterators in data.rs

dginev commented 8 years ago

Actually, a high level comment is that the difference between iterating over the sentences inside paragraph nodes, as opposed to all XML Nodes in the document, is non-trivial.

I worry about including things like bibliographies, heading titles, etc. into the tokenized content. It's a linguistics question whether they should be used, or omitted, and why. So ideally our library will allow to make that choice in a clear way. You could do that via the DNM options, but for that we should maybe allow a corpus-level entry that stores them globally for all iterated content.

dginev commented 8 years ago

Ok, merging here so that you can continue using master, I'll make a PR myself when the GloVe work is done.

dginev commented 8 years ago

Thanks for the PR!