QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

how to convert qwen.tiktoken to tokenzier.model #1204

Closed cloudyuyuyu closed 5 months ago

cloudyuyuyu commented 5 months ago

起始日期 | Start Date

No response

实现PR | Implementation PR

No response

相关Issues | Reference Issues

No response

摘要 | Summary

端侧推理引擎只能兼容 tokenizer.model,无法支持tiktoken模式

基本示例 | Basic Example

端侧推理引擎只能兼容 tokenizer.model

缺陷 | Drawbacks

端侧推理引擎只能兼容 tokenizer.model

未解决问题 | Unresolved questions

No response

jklj077 commented 5 months ago

Please refer to tokenization_note.md in this repo. Essentially, they are different algorithms and converting BPE vocabulary at the byte-level to sentencepiece vocabulary simply cannot be done due to inherent differences.