QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

Qwen對繁體中文的識別及生成能力 #1086

Closed ACBBZ closed 6 months ago

ACBBZ commented 7 months ago

Qwen對繁體中文的識別及生成能力,另外,如果要用XTuner微調,應該怎麽微調增加分詞表的大小,來支援繁體中文

jklj077 commented 7 months ago

Considering that this repository pertains to Qwen(1.0), please note that the following information may not directly apply to Qwen1.5.

Qwen(1.0) is capable of understanding and generating text in traditional Chinese, but its performance on such tasks has not been exhaustively tested. Thus, it's advisable to conduct your own evaluation to ensure its suitability for your specific use case.

Regarding the tokenizer and vocabulary, Qwen(1.0) employs a byte-level Byte Pair Encoding (BPE) approach, which allows it to process text in any language. Its existing vocabulary includes over 150,000 tokens and encompasses traditional Chinese characters. For those interested in expanding or customizing the vocabulary, please refer to the detailed instructions in the tokenization_note.

For any subsequent changes or adjustments required for downstream applications, kindly seek support from the relevant party involved.

WangJianQ-cmd commented 7 months ago

我的Qwen经过微调之后,不知道为什么,会有概率输出繁体字