18907305772 / FuseAI

FuseAI Project
https://huggingface.co/FuseAI
76 stars 34 forks source link

Starling-LM-7B-alpha tokenizer issues #19

Closed duguodong7 closed 2 months ago

duguodong7 commented 2 months ago

你好,请问在使用Starling模型get_representation时,需要设置 tknz_trust_remote_code = True吗? 我们遇到了这样的问题: RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. 这个问题是要修改token config文件还是修改代码呢? 我跑了Qwen,Mixtral等其它模型都是正常的,就是starling跑到一多半的时候出现了这个问题。

非常期待你的回复!

duguodong7 commented 2 months ago

我打印 input_ids后,发现有个别tokenizer max_input_ids = 32002, 不知道这是什么原因产生的

duguodong7 commented 2 months ago

上面有个小的笔误,设置 tknz_trust_remote_code = True 是internlm2模型,不是starling, 这是因为interlm2需要token access的原因吗?

18907305772 commented 2 months ago

Please replace the files related to tokenizer of Starling to the files in OpenChat (added_tokens.json, generation_config.json, special_tokens_map.json, tokenizer.json, tokenizer.model, tokenizer_config.json), since there are some thing wrong in those files of Starling.

duguodong7 commented 2 months ago

Thanks very much, I work out by removing the sep_token in _(special_tokensmap.json and _tokenizerconfig.json) in Starling.