Pipeline (Tokenizer) has issues with non-UTF-8 characters

CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

Other

473 stars 142 forks source link

I had experiences with tokenizer failing on non-UTF-8 characters. (e.g. "�" below):

val text = "Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode \"replacement character\" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. The Apple's Last Resort font will display a substitute glyph indicating the Unicode range of the character, and the SIL International's Unicode Fallback font will display a box showing the hexadecimal scalar value of the character."

// AnnotationUtils.pipelineServerPOSTagger.annotate(text)  <---- doesn't work
val text2 = new String(text.getBytes("Windows-1252"), "UTF-8")
println(text2)
AnnotationUtils.pipelineServerPOSTagger.annotate(text2) // <----- this does work.

CogComp / cogcomp-nlp

Pipeline (Tokenizer) has issues with non-UTF-8 characters #594