how to convert qwen.tiktoken to tokenzier.model

QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Apache License 2.0

13.59k stars 1.11k forks source link

how to convert qwen.tiktoken to tokenzier.model #1204

Closed cloudyuyuyu closed 5 months ago

cloudyuyuyu commented 5 months ago

起始日期 | Start Date

No response

实现PR | Implementation PR

No response

摘要 | Summary

端侧推理引擎只能兼容 tokenizer.model，无法支持tiktoken模式

基本示例 | Basic Example

端侧推理引擎只能兼容 tokenizer.model

缺陷 | Drawbacks

端侧推理引擎只能兼容 tokenizer.model

未解决问题 | Unresolved questions

No response

jklj077 commented 5 months ago

Please refer to tokenization_note.md in this repo. Essentially, they are different algorithms and converting BPE vocabulary at the byte-level to sentencepiece vocabulary simply cannot be done due to inherent differences.

QwenLM / Qwen

how to convert qwen.tiktoken to tokenzier.model #1204

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions