Senna integration - Githubissues

jfschaefer commented 8 years ago

integrated senna into the iterators in data.rs

dginev commented 8 years ago

Actually, a high level comment is that the difference between iterating over the sentences inside paragraph nodes, as opposed to all XML Nodes in the document, is non-trivial.

I worry about including things like bibliographies, heading titles, etc. into the tokenized content. It's a linguistics question whether they should be used, or omitted, and why. So ideally our library will allow to make that choice in a clear way. You could do that via the DNM options, but for that we should maybe allow a corpus-level entry that stores them globally for all iterated content.

dginev commented 8 years ago

Ok, merging here so that you can continue using master, I'll make a PR myself when the GloVe work is done.

dginev commented 8 years ago

Thanks for the PR!

KWARC / llamapun

Senna integration #1