dbiir / UER-py

Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
https://github.com/dbiir/UER-py/wiki
Apache License 2.0
2.97k stars 528 forks source link

pretrain RoBERTa #235

Open skye95git opened 2 years ago

skye95git commented 2 years ago
  1. I want to do incremental pre-training on the existing RoBERTa. Which RoBERTa model should I use? Download directly from Hugging Face? Do I need to script it into UER format after downloading it? Is there a conversion script if necessary?

  2. RoBERTa's input corpus format is one document in a row. I am dealing with code corpus and want to change a sample into natural language+program language. What should I do?

https://github.com/dbiir/UER-py/blob/862526f2c256ec9d644c5fb99d9b0cbee77254f5/uer/utils/data.py#L336

Can I just change this to [CLS] + natural language + [SEP] + program language + [EOS]?

hhou435 commented 2 years ago

1.If you want to use the model in UER format, you can download it directly from the link https://github.com/dbiir/UER-py/wiki/Modelzoo 2.Yes, you can customize the input by modifying the Dataset

skye95git commented 2 years ago

1.If you want to use the model in UER format, you can download it directly from the link https://github.com/dbiir/UER-py/wiki/Modelzoo 2.Yes, you can customize the input by modifying the Dataset

Thanks for your reply. I've seen the models in Modelzoo before. The RoBERTa models in Modelzoo are all trained on Chinese corpus. But what I need is the RoBERTa model pre-trained on the English corpus. How do I get it?

hhou435 commented 2 years ago

You can download the model from huggingface and convert it to UER format by the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py

skye95git commented 2 years ago

You can download the model from huggingface and convert it to UER format by the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py

So, RoBERTa and BERT share the same script, right? No matter converting model from UER format to Huggingface format (PyTorch) or converting model from Huggingface format (PyTorch) to UER format.

hhou435 commented 2 years ago

You can use --target to select BERT or RoBERTa

skye95git commented 2 years ago

You can use --target to select BERT or RoBERTa

If I want to use RoBERTa, can I just set --target to MLM?

hhou435 commented 2 years ago

Yes

skye95git commented 2 years ago

Yes

How to set --layers_num when I convert huggingface RoBERTa to UER? There are different choices in the examples. Do you set the number of layers of Transformer according to your own task?

微信图片_20211216142748
skye95git commented 2 years ago

You can download the model from huggingface and convert it to UER format by the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py

Hi, I have downloaded the RoBERTa from huggingface: https://huggingface.co/roberta-base/tree/main.

12162

Then I run the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py:

python scripts/convert_bert_from_huggingface_to_uer.py --input_model_path /platform_tech/linjiayi/UER/huggingface_model/RoBERTa/pytorch_model.bin \
                                                                            --output_model_path /platform_tech/linjiayi/UER/model/RoBERTa/uer_pytorch_model.bin \
                                                                            --layers_num 12 --target mlm

There is an error:

12163

What should I do?

skye95git commented 2 years ago

Hi, If I want use English corpus to pre-train RoBERTa from scratch, Which dictionary should I use? I use the vocab.json downloaded from huggingface, but the instance show None:

python preprocess.py --corpus_path /platform_tech/linjiayi/dataset/codebase.jsonl --vocab_path /platform_tech/linjiayi/UER/huggingface_model/RoBERTa/vocab.json \
                      --dataset_path /platform_tech/linjiayi/UER/dataset/dataset.pt --processes_num 8 \
                      --dynamic_masking --target mlm

image

Do I need to rebuild the vocab myself?

hhou435 commented 2 years ago

You can download the model from huggingface and convert it to UER format by the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py

Hi, I have downloaded the RoBERTa from huggingface: https://huggingface.co/roberta-base/tree/main. 12162

Then I run the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py:

python scripts/convert_bert_from_huggingface_to_uer.py --input_model_path /platform_tech/linjiayi/UER/huggingface_model/RoBERTa/pytorch_model.bin \
                                                                            --output_model_path /platform_tech/linjiayi/UER/model/RoBERTa/uer_pytorch_model.bin \
                                                                            --layers_num 12 --target mlm

There is an error: 12163

What should I do?

Hello, this is because the model parameter names do not match due to the too old version of Transformers used by the model. We have modified it in the latest code.

hhou435 commented 2 years ago

Hi, If I want use English corpus to pre-train RoBERTa from scratch, Which dictionary should I use? I use the vocab.json downloaded from huggingface, but the instance show None:

python preprocess.py --corpus_path /platform_tech/linjiayi/dataset/codebase.jsonl --vocab_path /platform_tech/linjiayi/UER/huggingface_model/RoBERTa/vocab.json \
                      --dataset_path /platform_tech/linjiayi/UER/dataset/dataset.pt --processes_num 8 \
                      --dynamic_masking --target mlm

image

Do I need to rebuild the vocab myself?

You need to use models/google_uncased_en_vocab.txt in the project file

skye95git commented 2 years ago

Hello, this is because the model parameter names do not match due to the too old version of Transformers used by the model. We have modified it in the latest code.

@hhou435 Hi, I have tried the latest code, there is still an error: image

https://github.com/dbiir/UER-py/blob/af5232aa125ce0dc8d5d9285c6410e62ad95a298/scripts/convert_bert_from_huggingface_to_uer.py#L60

hhou435 commented 2 years ago

Hi,You can use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py

skye95git commented 2 years ago

Hi,You can use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py

The script can be used to convert successfully. But the transformed model seems a little different from Roberta. When I do incremental pre-training, there is an errer: image

skye95git commented 2 years ago

@hhou435 Hi, every time I train MLM with my own data, the log freezes after printing the following message:

Start slurm job at Thu 23 Dec 2021 02:41:12 PM CST
Using distributed mode for training.
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

It would be hours before the pre-training logs were printed: image

The GPU was definitely being utilized during this time: image

The pre-training data was processed in advance(dataste.pt). I wonder what the program was doing when the log stopped printing for a few hours? Loading data?

skye95git commented 2 years ago

The loss of pre-training is shown below: image Can the loss state be regarded as convergence?

hhou435 commented 2 years ago

Hi,You can use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py

The script can be used to convert successfully. But the transformed model seems a little different from Roberta. When I do incremental pre-training, there is an errer: image

Hello, we found that the English Roberta uploaded by huggingface requires the use of bpe tokenizer. We currently support the use of bpe tokenzier. You need to specify --tokenizer bpe --vocab_path huggingface_gpt2_vocab.txt --merges_path huggingface_gpt2_merges.txt during pre-processing and pre-training.

hhou435 commented 2 years ago

We do not need a few hours to load the data, maybe your log is saved in the print buffer

@hhou435 Hi, every time I train MLM with my own data, the log freezes after printing the following message:

Start slurm job at Thu 23 Dec 2021 02:41:12 PM CST
Using distributed mode for training.
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

It would be hours before the pre-training logs were printed: image

The GPU was definitely being utilized during this time: image

The pre-training data was processed in advance(dataste.pt). I wonder what the program was doing when the log stopped printing for a few hours? Loading data?

We do not need a few hours to load the data, maybe your log is saved in the print buffer?

hhou435 commented 2 years ago

The loss of pre-training is shown below: image Can the loss state be regarded as convergence?

Have you shuffled the training data? The loss curve looks very fluctuating

skye95git commented 2 years ago

Hi,You can use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py

The script can be used to convert successfully. But the transformed model seems a little different from Roberta. When I do incremental pre-training, there is an errer: image

Hello, we found that the English Roberta uploaded by huggingface requires the use of bpe tokenizer. We currently support the use of bpe tokenzier. You need to specify --tokenizer bpe --vocab_path huggingface_gpt2_vocab.txt --merges_path huggingface_gpt2_merges.txt during pre-processing and pre-training.

Do I still use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py?

skye95git commented 2 years ago

The loss of pre-training is shown below: image Can the loss state be regarded as convergence?

Have you shuffled the training data? The loss curve looks very fluctuating

I used the Codesearchnet dataset for six programming languages and didn't mess with the data.

Here's the latest pre-training curve, can the loss state be regarded as convergence? image

skye95git commented 2 years ago

Hello, we found that the English Roberta uploaded by huggingface requires the use of bpe tokenizer. We currently support the use of bpe tokenzier. You need to specify --tokenizer bpe --vocab_path huggingface_gpt2_vocab.txt --merges_path huggingface_gpt2_merges.txt during pre-processing and pre-training.

Hi, I use the convert_xlmroberta_from_huggingface_to_uer.py to convert Roberta. When I do incremental pre-training with the bpe tokenizer. , there is an error:

RuntimeError: Error(s) in loading state_dict for Model:
    size mismatch for embedding.position_embedding.weight: copying a param with shape torch.Size([514, 768]) f
rom checkpoint, the shape in current model is torch.Size([512, 768]).
hhou435 commented 2 years ago

Hello, we found that the English Roberta uploaded by huggingface requires the use of bpe tokenizer. We currently support the use of bpe tokenzier. You need to specify --tokenizer bpe --vocab_path huggingface_gpt2_vocab.txt --merges_path huggingface_gpt2_merges.txt during pre-processing and pre-training.

Hi, I use the convert_xlmroberta_from_huggingface_to_uer.py to convert Roberta. When I do incremental pre-training with the bpe tokenizer. , there is an error:

RuntimeError: Error(s) in loading state_dict for Model:
  size mismatch for embedding.position_embedding.weight: copying a param with shape torch.Size([514, 768]) f
rom checkpoint, the shape in current model is torch.Size([512, 768]).

Because the structure of the roberta model uploaded by huggingface is different from that of the bert model, the weights do not match. You can use the configuration of the xlmroberta model. The structure of this model is exactly the same as the roberta model uploaded by huggingface. When using, you need to replace the file here with models/xlmroberta_special_tokens_map.json