PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

Untokenizable token not previously encountered #39

Closed ablaette closed 1 year ago

ablaette commented 1 year ago

This is a warning I recently see. The character should be dealt with by the preprocessor.

[main] WARN edu.stanford.nlp.process.PTBLexer - Untokenizable:  (U+9D, decimal: 157)

ablaette commented 1 year ago

See here: https://codepoints.net/U+009D?lang=de This is an "Operating System Command" that can be deleted safely.