All input / output should be in Unicode

buruzaemon / natto-py

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.

BSD 2-Clause "Simplified" License

92 stars 13 forks source link

All input / output should be in Unicode #5

Closed buruzaemon closed 9 years ago

buruzaemon commented 9 years ago

From Porting your code to NLTK 3.0: ...

NLTK3 requires all text input to be unicode and always return text as unicode

Enhance the behavior of natto-py under Python 2.7 to make sure that this behavior is consistent. Python 3 behavior should be consistent with the above approach.

Originally opened 2014-11-12. This issue was ported from Bitbucket and is archived for historical reasons.

buruzaemon commented 9 years ago

Decoding Python 2.7 strings should use the "charset" (character encoding) used internally by MeCab.

This means that the user needs to keep in mind the "charset" being used by MeCab. Might need to add a Wiki page on confirming the system dictionary charset from the command-line.

buruzaemon commented 9 years ago

Done in 0.0.6 release.

Resolved 2014-11-21. This issue was ported from Bitbucket and is archived for historical reasons.