DLLXW / baby-llama2-chinese

用于从头预训练+SFT一个小参数量的中文LLaMa2的仓库;24G单卡即可运行得到一个具备简单中文问答能力的chat-llama2.
MIT License
2.32k stars 284 forks source link

Problem with tokenizer? #63

Open shokhjakhonone opened 3 months ago

shokhjakhonone commented 3 months ago

I am writing to ask for your help with a problem I am having with the tokenizer. I have been trying to solve it for a while now, but I have been unsuccessful. However, I am having trouble with : Traceback (most recent call last): File "/content/baby-llama2-chinese/eval.py", line 81, in tokenizer=ChatGLMTokenizer(vocab_file='./chatglm_tokenizer/tokenizer.model') File "/content/baby-llama2-chinese/chatglm_tokenizer/tokenization_chatglm.py", line 68, in init super().init(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py", line 367, in init self._add_tokens( File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py", line 467, in _add_tokens current_vocab = self.get_vocab().copy() File "/content/baby-llama2-chinese/chatglm_tokenizer/tokenization_chatglm.py", line 112, in get_vocab vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)} File "/content/baby-llama2-chinese/chatglm_tokenizer/tokenization_chatglm.py", line 107, in vocab_size return self.tokenizer.n_words AttributeError: 'ChatGLMTokenizer' object has no attribute 'tokenizer'. Did you mean: 'tokenize'?

I would be very grateful if you could help me solve this problem. I am available to answer any questions you may have.

Thank you for your time and consideration.

ringwraith commented 3 months ago

Reverting to transformers==4.33.0 resolved the problem.

buhe commented 1 month ago

Python version too high,switch to 3.10

buhe commented 1 month ago

https://github.com/DLLXW/baby-llama2-chinese/pull/65