Closed EINDEX closed 2 years ago
Hi, Thanks for your suggestions, I will try this.
Hi @yagebu, Your suggestion is amazing, I am done with your idea.
Hi, this looks good. Thank you @yagebu for suggesting the solution where jieba is passed in as api parameter. Thank you @EINDEX for implementing it this way.
I would like to suggest the following improvements for this PR:
thank you! best regards, Johannes
Hi @johannesjh, I am adding a document for this, but don't know how to test this function. The data processing is processed in the sklearn module. Could you provide an idea on this?
Thank you for adding documentation.
As for the tests, @tarioch created regression tests, see tests/data_test.py
. You could create an additional, similar test with chinese test data. The additional test will require jieba to be installed; you can add it to tox.ini as a test dependency.
Hi @johannesjh, thanks very much for your suggestion.
Just got some free time to add testing data. I think the string tokenizer is ready for smart importer
now.
looks good to me, thank you @EINDEX
Hi Contributors,
I am Chinese, and love Beancount, also want to use smart_import to enhance experiences when I import Bank statements. I try this tool, but you know, the Chinese did not have a break or space within words. So the SVM cannot analyze Chinese now. Here is a Chinese sentence.
eg. "我和小明一起吃晚饭。"
We rely on the tokenizer tool to split words. So I used the most popular tokenizer tool jieba to support this function.
I am sure, my code can let the smart_import smarter. I wrote some code and tests but in a rude way.
If you have any suggestions or feedback, very welcome to write them here.
Thanks.