Corpus.findTokens and Corpus.findTokensWithin need ignorePunctuation

cite-architecture / ohco2

A cross-platform library for working with collections of texts in the OHCO2 model

https://cite-architecture.github.io/ohco2/

GNU General Public License v3.0

2 stars 1 forks source link

Corpus.findTokens and Corpus.findTokensWithin need ignorePunctuation #48

Closed Eumaeus closed 7 years ago

Eumaeus commented 7 years ago

Testing on Vector("Gyges","Ardys") fails, because the text has "Ardys the son of Gyges,"

This is absolutely correct according to the documentation, which specifies white space as a delimiter, but it will confuse people.

neelsmith commented 7 years ago

This needs a more serious design review to consider how best to organize a Corpus DSL.

neelsmith commented 7 years ago

Moving this to new milestone for DSL redesign

neelsmith commented 7 years ago

Implemented, including new short-hand functions findWSTokens that matches on "white-space delimited" tokenization and findWordTokens that matches on "word" tokens (white-space delimited, ignoring punctuation)