-
Currently charabia has wrong segmentation in Chinese and Japanese #591 ,1.1.1-alpha.1 not solving problem.
My native language is Chinese, and I am developing a web application. Therefore, I tried u…
-
Nikolay:
Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-cont…
-
In the [README](https://github.com/fastnlp/CPT/blob/master/pretrain/README.md) of pre-training, it mentions that the `dataset`, `vocab` and `roberta_zh` have to be prepared before training.
Is ther…
-
I'm using jieba for tokenization for my Chinese documents, as suggested here in the issues and in the documentation. It also says in the documentation that if I use a vectorizer, I cannot use a candid…
-
### Self Checks
- [X] I have searched for existing issues [search for existing issues](https://github.com/langgenius/dify/issues), including closed ones.
- [X] I confirm that I am using English to…
-
Hi,
I'm trying to use `Jieba.Cut(text, result)` here, but the result shows that, it counts `offset`s by bytes, not unicode characters.
My text content have Chinese and English characters mixed, so I …
-
Chinese texts need a special kind of tokenization. Their texts cannot be simply split by whitespace or characters. It would be nice to add a separate module for segmenting Chinese texts.
Option 1: …
-
ICU is not a good choice in China. In addition, it is very important for Chinese word segmentation to customize the dictionary, because the application of words in different industries is completely d…
-
### Proposal:
it could be better to add support of the custom rules into ICU integration
- [Rule Based Number Format](https://unicode-org.github.io/icu/userguide/format_parse/numbers/rbnf.html#rules…
-
I just wondering whether this tool support chinese corpus.
For example, do i suppose to use Jieba or other chinese tokenizer ? And is there interface reserved for chinese tokenizer...
Thanks a l…