-
Nikolay:
Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese ch…
-
Currently charabia has wrong segmentation in Chinese and Japanese #591 ,1.1.1-alpha.1 not solving problem.
My native language is Chinese, and I am developing a web application. Therefore, I tried u…
-
Nikolay:
Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-cont…
-
Name: jieba
Version: 0.42.1
Summary: Chinese Words Segmentation Utilities
Home-page: https://github.com/fxsjy/jieba
Author: Sun, Junyi
Author-email: ccnusjy@gmail.com
License: MIT
```
# enco…
-
**Describe the bug**
When opening youtube video that has 2 subs: English, Chinese (Simplified), the App shows English subs
**To Reproduce**
Steps to reproduce the behavior:
1. Open App
2. Choos…
-
Hi, I would like to use this package to help with Chinese learning. I would be willing to help with development, but might need some pointers. I would probably use `jieba` for tokenization. Please let…
-
Currently, the tokenizer is hard-coded to default, it would be better to include some configurable tokenizer for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segm…
ghost updated
2 weeks ago
-
In the [README](https://github.com/fastnlp/CPT/blob/master/pretrain/README.md) of pre-training, it mentions that the `dataset`, `vocab` and `roberta_zh` have to be prepared before training.
Is ther…
-
I'm using jieba for tokenization for my Chinese documents, as suggested here in the issues and in the documentation. It also says in the documentation that if I use a vectorizer, I cannot use a candid…
-
ICU is not a good choice in China. In addition, it is very important for Chinese word segmentation to customize the dictionary, because the application of words in different industries is completely d…