Starling-LM-7B-alpha tokenizer issues

18907305772 / FuseAI

FuseAI Project

https://huggingface.co/FuseAI

76 stars 34 forks source link

Starling-LM-7B-alpha tokenizer issues #19

Closed duguodong7 closed 2 months ago

duguodong7 commented 2 months ago

你好，请问在使用Starling模型get_representation时，需要设置 tknz_trust_remote_code = True吗？我们遇到了这样的问题： RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. 这个问题是要修改token config文件还是修改代码呢？我跑了Qwen，Mixtral等其它模型都是正常的，就是starling跑到一多半的时候出现了这个问题。

非常期待你的回复！

duguodong7 commented 2 months ago

我打印 input_ids后，发现有个别tokenizer max_input_ids = 32002, 不知道这是什么原因产生的

duguodong7 commented 2 months ago

上面有个小的笔误，设置 tknz_trust_remote_code = True 是internlm2模型，不是starling, 这是因为interlm2需要token access的原因吗？

18907305772 commented 2 months ago

Please replace the files related to tokenizer of Starling to the files in OpenChat (added_tokens.json, generation_config.json, special_tokens_map.json, tokenizer.json, tokenizer.model, tokenizer_config.json), since there are some thing wrong in those files of Starling.

duguodong7 commented 2 months ago

Thanks very much, I work out by removing the sep_token in _(special_tokensmap.json and _tokenizerconfig.json) in Starling.