Open 1719930244 opened 1 year ago
官方给的sentenpiece的词表tokenizer.mode()的vocab_size比nn.Embedding()少,导致生成词表外的id时无法decode而错误.
64789--->65024
告知避免生成错误token_id的方法
decode 64789 以后的token_id
- OS: - Python: - Transformers: - PyTorch: - CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
No response
same issue
Is there an existing issue for this?
Current Behavior
官方给的sentenpiece的词表tokenizer.mode()的vocab_size比nn.Embedding()少,导致生成词表外的id时无法decode而错误.
64789--->65024
Expected Behavior
告知避免生成错误token_id的方法
Steps To Reproduce
decode 64789 以后的token_id
Environment
Anything else?
No response