CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
http://nlp.cogcomp.org/
Other
473 stars 142 forks source link

Pipeline (Tokenizer) has issues with non-UTF-8 characters #594

Open danyaljj opened 6 years ago

danyaljj commented 6 years ago

I had experiences with tokenizer failing on non-UTF-8 characters. (e.g. "�" below):

val text = "Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode \"replacement character\" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. The Apple's Last Resort font will display a substitute glyph indicating the Unicode range of the character, and the SIL International's Unicode Fallback font will display a box showing the hexadecimal scalar value of the character."

// AnnotationUtils.pipelineServerPOSTagger.annotate(text)  <---- doesn't work
val text2 = new String(text.getBytes("Windows-1252"), "UTF-8")
println(text2)
AnnotationUtils.pipelineServerPOSTagger.annotate(text2) // <----- this does work. 
mssammon commented 6 years ago

We have some cleanup code for this kind of problem: https://github.com/CogComp/cogcomp-nlp/blob/master/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/TextCleanerStringTransformation.java https://github.com/CogComp/cogcomp-nlp/blob/master/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/StringTransformationCleanup.java If these don't cover such cases, this is where the fixes should be added. We could, by default, run some cleanup as part of the pipeline main(), but I'm open to suggestions.