[Feature] Add Chinese support via Jieba tokenizer

beancount / smart_importer

Augment Beancount importers with machine learning functionality.

MIT License

248 stars 29 forks source link

[Feature] Add Chinese support via Jieba tokenizer #115

Closed EINDEX closed 2 years ago

EINDEX commented 2 years ago

Hi Contributors,

I am Chinese, and love Beancount, also want to use smart_import to enhance experiences when I import Bank statements. I try this tool, but you know, the Chinese did not have a break or space within words. So the SVM cannot analyze Chinese now. Here is a Chinese sentence.

eg. "我和小明一起吃晚饭。"

We rely on the tokenizer tool to split words. So I used the most popular tokenizer tool jieba to support this function.

I am sure, my code can let the smart_import smarter. I wrote some code and tests but in a rude way.

If you have any suggestions or feedback, very welcome to write them here.

Thanks.

EINDEX commented 2 years ago

Hi, Thanks for your suggestions, I will try this.

EINDEX commented 2 years ago

Hi @yagebu, Your suggestion is amazing, I am done with your idea.

johannesjh commented 2 years ago

Hi, this looks good. Thank you @yagebu for suggesting the solution where jieba is passed in as api parameter. Thank you @EINDEX for implementing it this way.

I would like to suggest the following improvements for this PR:

To please add a test case
To please add documentation, e.g. in the "documentation" section of the README file, regarding how to use this feature

thank you! best regards, Johannes

EINDEX commented 2 years ago

Hi @johannesjh, I am adding a document for this, but don't know how to test this function. The data processing is processed in the sklearn module. Could you provide an idea on this?

johannesjh commented 2 years ago

Thank you for adding documentation.

As for the tests, @tarioch created regression tests, see tests/data_test.py. You could create an additional, similar test with chinese test data. The additional test will require jieba to be installed; you can add it to tox.ini as a test dependency.

EINDEX commented 2 years ago

Hi @johannesjh, thanks very much for your suggestion.

Just got some free time to add testing data. I think the string tokenizer is ready for smart importer now.

johannesjh commented 2 years ago

looks good to me, thank you @EINDEX