DLLXW / baby-llama2-chinese

用于从头预训练+SFT一个小参数量的中文LLaMa2的仓库;24G单卡即可运行得到一个具备简单中文问答能力的chat-llama2.
MIT License
2.44k stars 300 forks source link

ChatGLMTokenizer类 #82

Open licx102359 opened 2 weeks ago

licx102359 commented 2 weeks ago

File "/qiuwkai27/cx/baby-llama2-chinese/sft.py", line 274, in tokenizer=ChatGLMTokenizer(vocab_file='./chatglm_tokenizer/tokenizer.model') File "/qiuwkai27/cx/baby-llama2-chinese/chatglm_tokenizer/tokenization_chatglm.py", line 68, in init super().init(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs) File "/root/miniconda3/envs/cxx/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 436, in init self._add_tokens( File "/root/miniconda3/envs/cxx/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 544, in _add_tokens current_vocab = self.get_vocab().copy() File "/qiuwkai27/cx/baby-llama2-chinese/chatglm_tokenizer/tokenization_chatglm.py", line 110, in get_vocab vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)} File "/qiuwkai27/cx/baby-llama2-chinese/chatglm_tokenizer/tokenization_chatglm.py", line 106, in vocab_size return self.tokenizer.n_words AttributeError: 'ChatGLMTokenizer' object has no attribute 'tokenizer'. Did you mean: 'tokenize'? 我看文件定义了啊,为什么还是报这种错误

dt-3t commented 2 weeks ago

\chatglm_tokenizer\tokenization_chatglm.py 文件中 ChatGLMTokenizer 类的 __init__ 函数中的 self.tokenizer = SPTokenizer(vocab_file) 这一行移动至 super().__init__(...) 这一行的上方即可。

licx102359 commented 5 days ago

\chatglm_tokenizer\tokenization_chatglm.py文件中ChatGLMTokenizer类的__init__函数中的self.tokenizer = SPTokenizer(vocab_file)这一行移动到super().__init__(...)上面的这一行即可。 是transformers版本问题