Open ajdapretnar opened 4 years ago
Hi orange text team. May I recommend a third option? I'm maintaining an open source multilingual NLP package called HanLP backed by the start-of-the-art deep learning techniques and also efficient traditional ML models. HanLP has been widely used in academia and production environment (see our citations and projects using HanLP). Recently our user told me you are planning for Chinese support, so I'd like to suggest for a more advanced option. If you're interested you can try out our demo.
@hankcs Thanks for the suggestion. I've already heard about HanLP. I'd love to try your demo, but I simply cannot make sense of it 😆 (I don't speak any Chinese). Would you perhaps be interested in submitting a PR? Namely, adding HanLP to Preprocess Text (perhaps it can even be a separate preprocessor)? We would need a Chinese speaker to write tests at least.
Sure, glad to help. Let's decide the version first since new package means new depdendencies. What kind of dependencies would you like to introduce?
Sorry this got on hold for such a long time. :( Not sure how we managed to forget about this issue.
I vote for HanLPerceptron as using TF would add a large dependency for a single task.
Great, HanLPerceptron is a good choice. Let's see what needs to be done.
Basically, this would be a new Preprocessor, let's call it HanTokenization (feel free to come up with a more sensible name). It is added to orangecontrib.text.preprocess
and inherits from Preprocessor
. I would not add it to Tokenizer, but make a separate special tokenizer. What do you think @PrimozGodec?
One downside is that HanLPerceptron doesn't seem to have wheels. We need to make sure it can be installed on all platforms (Win, OSX, Linux). If not, the user is responsible for installing it herself and when the dependency is present, the preprocessor can be used.
The preprocessor should simply set the corresponding functions and/or properties if necessary.
The most important part, tests should be added to orangecontrib.text.tests.test_preprocess.py
to make sure the widget returns sensible results. Also, perhaps check the tokenizer in combination with different preprocessors, such as filtering, lowercasing and such, to make sure it is again sensible for Chinese.
I believe the task is quite trivial, but good tests need to be written to ensure the results make sense.
Sounds good. I'll work together with the author of HanLPerceptron
to have the wheels built and tested first.
Related to #781 issue.
Hope this feature can be implemented ASAP. It's vital for Chinese text-processing!
We are happy to accept contributions from the community. If you are willing to add a PR, we will review it with priority.
Chinese texts need a special kind of tokenization. Their texts cannot be simply split by whitespace or characters. It would be nice to add a separate module for segmenting Chinese texts.
Option 1: NLTK with Stanford segmenter.
Option 2: Jieba.
I would try with NLTK first to avoid introducing new dependencies, then fallback to Jieba if NTLK proves insufficient.