IDEA-CCNL / Fengshenbang-LM

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
Apache License 2.0
3.99k stars 374 forks source link

T5-qa <extra_id_0> 的mast_id 返回为2? #338

Open junphine opened 1 year ago

junphine commented 1 year ago

path = "/data/nlp/models/IDEA-CCNL/Randeng-T5-784M-QA-Chinese" tokenizer = AutoTokenizer.from_pretrained(path,trust_remote_code=True) mast_id = tokenizer.convert_tokens_to_ids("")

mast_id: 2

应该是词表最后100位之中

ganzhiruyi commented 1 year ago

使用T5Tokenizer替换AutoTokenizer,示例也是用的T5Tokenizer,使用AutoTokenizer会有下面的warning信息提示你为什么不能直接用AutoTokenizer。 UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.