Evaluating tokenizer? - Githubissues

CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

http://nlp.cogcomp.org/

Other

473 stars 142 forks source link

Evaluating tokenizer? #480

Open danyaljj opened 7 years ago

danyaljj commented 7 years ago

@mssammon We briefly discussed having tests / evaluations for tokenizer. Thoughts how hard/easy that might be? If we have the data, I can have a look.

danyaljj commented 7 years ago

The reason that I'm asking for this is that @Slash0BZ is trying to apply some fixes and I want to make sure we're not breaking anything.

mssammon commented 7 years ago

The last version of tokenizer that had tests -- for which we used the MASC corpus -- is here: https://gitlab-beta.engr.illinois.edu/cogcomp/illinois-tokenizer/tree/master

danyaljj commented 7 years ago

@mssammon Is this data public, or proprietary?

@Slash0BZ could you monitor the progress on this data, while you're fixing the tokenizer issues you had?

Slash0BZ commented 7 years ago

My current approach is adding more exceptions in TokenizerStateMachine at the part where it checks if a "." character means the end of a sentence. I will monitor this progress as mentioned above.

bhargav commented 7 years ago

On a related note: You can try if using the AnnotatorFixer for ACE helps in correcting sentence boundaries. You can use the Entity view for fixing sentence boundaries. Just realized @mssammon had added this as part of XMLTextAnnotation changes. Might help.