dbiir / UER-py

Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
https://github.com/dbiir/UER-py/wiki
Apache License 2.0
2.98k stars 527 forks source link

预训练时加载huggingface T5模型报错 #328

Open fade-color opened 2 years ago

fade-color commented 2 years ago

使用经过脚本转换后的huggingface上的mengzi-t5-base模型时报错:

RuntimeError: Error(s) in loading state_dict for Model:
        size mismatch for embedding.word_embedding.weight: copying a param with shape torch.Size([32128, 768]) from checkpoint, the shape in current model is torch.Size([32028, 768]).
        size mismatch for tgt_embedding.word_embedding.weight: copying a param with shape torch.Size([32128, 768]) from checkpoint, the shape in current model is torch.Size([32028, 768]).
        size mismatch for target.output_layer.weight: copying a param with shape torch.Size([32128, 768]) from checkpoint, the shape in current model is torch.Size([32028, 768]).

参数:

python3 pretrain.py --dataset_path data/dataset.pt \
                    --pretrained_model_path pretrained_model.bin \
                    --spm_model_path models/Langboat/mengzi-t5-base/spiece.model \
                    --vocab_path models/Langboat/mengzi-t5-base/spiece.vocab \
                    --config_path models/t5-v1_1/base_config.json \
                    --output_model_path output_model.bin \
                    --world_size 2 --gpu_ranks 0 1 \
                    --learning_rate 3e-5 --batch_size 8 \
                    --span_masking --span_geo_prob 0.3 --span_max_length 5 \
                    --data_processor t5 \
                    --embedding word --relative_position_embedding --remove_embedding_layernorm --tgt_embedding word --share_embedding \
                    --encoder transformer --mask fully_visible --decoder transformer \
                    --layernorm_positioning pre --layernorm t5 --feed_forward gated --remove_attention_scale --remove_transformer_bias \
                    --target lm

请问是哪里的问题呢,感谢

fade-color commented 2 years ago

没有加载T5的special_tokens_map.json的方法吗,我看到只有加载xlmroberta的

hhou435 commented 2 years ago

目前UER的sentencepiece tokenizer和Huggingface的T5 tokenizer实现有些区别,如果要使用的话,可以参考tokenizers.py中的XLMRobertaTokenizer实现一下T5Tokenizer