Closed jtauber closed 3 years ago
Oh that's a great catch! I wondered how it tagged some punctuation wrong in the confusion matrix. It'll be fixed in the next commit.
Maybe allow the use of a custom tokenizer?
I found …
in Shepherd of Hermas (27.3.1). There are also colons in there but they're in the Latin sections.
personally, I would just preprocess. In many cases the text will be in XML or some other format anyway so will require preprocessing. My run on John's Gospel involved preprocessing (although in that case it was concatenating an existing tokenization into a single string for the book)
That's fair, although using custom tokenizers seems pretty common practice in ML.
Yes, but that's in large part because they're dealing with much more text and aren't as interested in spending a lot of time on any one text (unlike us :-))
e.g. from John 1.38 in MorphGNT SBLGNT I get: