jieba-chinese Search Results

CloudCannon/pagefind #659

Use jieba to cut Chinese directly instead charabia

Currently charabia has wrong segmentation in Chinese and Japanese #591 ,1.1.1-alpha.1 not solving problem. My native language is Chinese, and I am developing a web application. Therefore, I tried u…

ColinWttt updated 2 months ago

mozilla/firefox-translations-training #742

Support CJK in OpusCleaner

Nikolay: Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-cont…

eu9ene updated 1 month ago

fastnlp/CPT #81

What are the data formats of `dataset` and `vocab` folder?

In the [README](https://github.com/fastnlp/CPT/blob/master/pretrain/README.md) of pre-training, it mentions that the `dataset`, `vocab` and `roberta_zh` have to be prepared before training. Is ther…

shivanraptor updated 1 week ago

MaartenGr/KeyBERT #247

Chinese documents and candidates

I'm using jieba for tokenization for my Chinese documents, as suggested here in the issues and in the documentation. It also says in the documentation that if I use a vectorizer, I cannot use a candid…

bsariturk updated 1 month ago

langgenius/dify #8034

It is necessary to upgrade the weaviate client.

### Self Checks - [X] I have searched for existing issues [search for existing issues](https://github.com/langgenius/dify/issues), including closed ones. - [X] I confirm that I am using English to…

jiandanfeng updated 3 days ago

yanyiwu/cppjieba #78

Is there any API count offset by characters(including Chines…

Hi, I'm trying to use `Jieba.Cut(text, result)` here, but the result shows that, it counts `offset`s by bytes, not unicode characters. My text content have Chinese and English characters mixed, so I …

royguo updated 5 days ago

biolab/orange3-text #536

Preprocess Text: add Chinese segmentation module

Chinese texts need a special kind of tokenization. Their texts cannot be simply split by whitespace or characters. It would be nice to add a separate module for segmenting Chinese texts. Option 1: …

ajdapretnar updated 3 months ago

manticoresoftware/manticoresearch #931

Jieba integration

ICU is not a good choice in China. In addition, it is very important for Chinese word segmentation to customize the dictionary, because the application of words in different industries is completely d…

oabu updated 3 weeks ago

manticoresoftware/manticoresearch #2507

support of the ICU custom rules

### Proposal: it could be better to add support of the custom rules into ICU integration - [Rule Based Number Format](https://unicode-org.github.io/icu/userguide/format_parse/numbers/rbnf.html#rules…

tomatolog updated 1 month ago

SimGus/Chatette #54

Does this tool support chinese?

I just wondering whether this tool support chinese corpus. For example, do i suppose to use Jieba or other chinese tokenizer ? And is there interface reserved for chinese tokenizer... Thanks a l…

RayShark updated 4 months ago

640 results for jieba-chinese

640 results
for jieba-chinese