clue-ai / ChatYuan

ChatYuan: Large Language Model for Dialogue in Chinese and English
https://www.clueai.cn
Other
1.9k stars 183 forks source link

如何基于YUAN模型进行无监督预训练? #19

Open zhangzai666 opened 1 year ago

zhangzai666 commented 1 year ago

我参考了一下hugging face的无监督训练代码,简单测试了一下。代码如下: from transformers import T5Tokenizer, T5ForConditionalGeneration import torch tokenizer = T5Tokenizer.from_pretrained("premodel/ChatYuan-large-v1") model = T5ForConditionalGeneration.from_pretrained("premodel/ChatYuan-large-v1") input_ids = tokenizer("一只走在大街上", return_tensors="pt").input_ids labels = tokenizer("可爱的宽敞的", return_tensors="pt").input_ids outputs = model(input_ids=input_ids, labels=labels) loss = outputs.loss logits = outputs.logits 结果报错: IndexError: index out of range in self 应该是embedding层索引越界,看了模型的词表,并没有标记,但是tokenizer后没有报错 请问如何基于YUAN模型进行无监督预训练?无监督预训练的数据格式是什么,万分感谢

joytianya commented 1 year ago

请参考readme里预训练的代码

zhangzai666 commented 1 year ago

十分感谢,chatyuan无监督训练的数据集简单示例可以看一下么,用什么进行标记mask的

joytianya commented 1 year ago

具体可以参考t5的构建规则哈https://github.com/google-research/text-to-text-transfer-transformer