Preprocess Text: add Chinese segmentation module

ajdapretnar commented 4 years ago

Chinese texts need a special kind of tokenization. Their texts cannot be simply split by whitespace or characters. It would be nice to add a separate module for segmenting Chinese texts.

Option 1: NLTK with Stanford segmenter.

Option 2: Jieba.

I would try with NLTK first to avoid introducing new dependencies, then fallback to Jieba if NTLK proves insufficient.

hankcs commented 3 years ago

Hi orange text team. May I recommend a third option? I'm maintaining an open source multilingual NLP package called HanLP backed by the start-of-the-art deep learning techniques and also efficient traditional ML models. HanLP has been widely used in academia and production environment (see our citations and projects using HanLP). Recently our user told me you are planning for Chinese support, so I'd like to suggest for a more advanced option. If you're interested you can try out our demo.

ajdapretnar commented 3 years ago

@hankcs Thanks for the suggestion. I've already heard about HanLP. I'd love to try your demo, but I simply cannot make sense of it 😆 (I don't speak any Chinese). Would you perhaps be interested in submitting a PR? Namely, adding HanLP to Preprocess Text (perhaps it can even be a separate preprocessor)? We would need a Chinese speaker to write tests at least.

hankcs commented 3 years ago

Sure, glad to help. Let's decide the version first since new package means new depdendencies. What kind of dependencies would you like to introduce?

HanLP 2.x is using TensorFlow which might be too heavy if you don't plan to do deep learning (DL). But 2.x delivers the best accuracy and the most functionalities.
HanLP 1.x is natively written using Java and pyhanlp is its Python wrapper using JPype. 1.x has been stabilized for 5 years but we are ambitious NLP people and we've decided to challenge the state-of-the-art DL techniques in 2.x.
HanLPerceptron is a native Python reimplementation of HanLP1.x tokenizer by @fann1993814 , which is neat and fast.

ajdapretnar commented 2 years ago

Sorry this got on hold for such a long time. :( Not sure how we managed to forget about this issue.

I vote for HanLPerceptron as using TF would add a large dependency for a single task.

hankcs commented 2 years ago

Great, HanLPerceptron is a good choice. Let's see what needs to be done.

ajdapretnar commented 2 years ago

Basically, this would be a new Preprocessor, let's call it HanTokenization (feel free to come up with a more sensible name). It is added to orangecontrib.text.preprocess and inherits from Preprocessor. I would not add it to Tokenizer, but make a separate special tokenizer. What do you think @PrimozGodec? One downside is that HanLPerceptron doesn't seem to have wheels. We need to make sure it can be installed on all platforms (Win, OSX, Linux). If not, the user is responsible for installing it herself and when the dependency is present, the preprocessor can be used. The preprocessor should simply set the corresponding functions and/or properties if necessary.

The most important part, tests should be added to orangecontrib.text.tests.test_preprocess.py to make sure the widget returns sensible results. Also, perhaps check the tokenizer in combination with different preprocessors, such as filtering, lowercasing and such, to make sure it is again sensible for Chinese.

I believe the task is quite trivial, but good tests need to be written to ensure the results make sense.

hankcs commented 2 years ago

Sounds good. I'll work together with the author of HanLPerceptron to have the wheels built and tested first.

ajdapretnar commented 2 years ago

Related to #781 issue.

fishfree commented 1 month ago

Hope this feature can be implemented ASAP. It's vital for Chinese text-processing!

ajdapretnar commented 1 month ago

We are happy to accept contributions from the community. If you are willing to add a PR, we will review it with priority.

biolab / orange3-text

Preprocess Text: add Chinese segmentation module #536