Open danyaljj opened 6 years ago
We have some cleanup code for this kind of problem: https://github.com/CogComp/cogcomp-nlp/blob/master/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/TextCleanerStringTransformation.java https://github.com/CogComp/cogcomp-nlp/blob/master/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/StringTransformationCleanup.java If these don't cover such cases, this is where the fixes should be added. We could, by default, run some cleanup as part of the pipeline main(), but I'm open to suggestions.
I had experiences with tokenizer failing on non-UTF-8 characters. (e.g. "�" below):