THUDM / ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型
Other
15.68k stars 1.85k forks source link

生成的sentencepiece报错 #416

Open 1719930244 opened 1 year ago

1719930244 commented 1 year ago

Is there an existing issue for this?

Current Behavior

官方给的sentenpiece的词表tokenizer.mode()的vocab_size比nn.Embedding()少,导致生成词表外的id时无法decode而错误.

64789--->65024 image

image image

Expected Behavior

告知避免生成错误token_id的方法

Steps To Reproduce

decode 64789 以后的token_id

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

NTDXYG commented 1 year ago

same issue