Closed jfschaefer closed 8 years ago
Actually, a high level comment is that the difference between iterating over the sentences inside paragraph nodes, as opposed to all XML Nodes in the document, is non-trivial.
I worry about including things like bibliographies, heading titles, etc. into the tokenized content. It's a linguistics question whether they should be used, or omitted, and why. So ideally our library will allow to make that choice in a clear way. You could do that via the DNM options, but for that we should maybe allow a corpus-level entry that stores them globally for all iterated content.
Ok, merging here so that you can continue using master, I'll make a PR myself when the GloVe work is done.
Thanks for the PR!
integrated senna into the iterators in data.rs