Open danyaljj opened 7 years ago
The reason that I'm asking for this is that @Slash0BZ is trying to apply some fixes and I want to make sure we're not breaking anything.
The last version of tokenizer that had tests -- for which we used the MASC corpus -- is here: https://gitlab-beta.engr.illinois.edu/cogcomp/illinois-tokenizer/tree/master
@mssammon Is this data public, or proprietary?
@Slash0BZ could you monitor the progress on this data, while you're fixing the tokenizer issues you had?
My current approach is adding more exceptions in TokenizerStateMachine at the part where it checks if a "." character means the end of a sentence. I will monitor this progress as mentioned above.
On a related note: You can try if using the AnnotatorFixer for ACE helps in correcting sentence boundaries. You can use the Entity view for fixing sentence boundaries. Just realized @mssammon had added this as part of XMLTextAnnotation changes. Might help.
@mssammon We briefly discussed having tests / evaluations for tokenizer. Thoughts how hard/easy that might be? If we have the data, I can have a look.