Open skye95git opened 2 years ago
1.If you want to use the model in UER format, you can download it directly from the link https://github.com/dbiir/UER-py/wiki/Modelzoo 2.Yes, you can customize the input by modifying the Dataset
1.If you want to use the model in UER format, you can download it directly from the link https://github.com/dbiir/UER-py/wiki/Modelzoo 2.Yes, you can customize the input by modifying the Dataset
Thanks for your reply. I've seen the models in Modelzoo before. The RoBERTa models in Modelzoo are all trained on Chinese corpus. But what I need is the RoBERTa model pre-trained on the English corpus. How do I get it?
You can download the model from huggingface and convert it to UER format by the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py
You can download the model from huggingface and convert it to UER format by the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py
So, RoBERTa and BERT share the same script, right? No matter converting model from UER format to Huggingface format (PyTorch) or converting model from Huggingface format (PyTorch) to UER format.
You can use --target
to select BERT or RoBERTa
You can use
--target
to select BERT or RoBERTa
If I want to use RoBERTa, can I just set --target
to MLM?
Yes
Yes
How to set --layers_num
when I convert huggingface RoBERTa to UER?
There are different choices in the examples. Do you set the number of layers of Transformer according to your own task?
You can download the model from huggingface and convert it to UER format by the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py
Hi, I have downloaded the RoBERTa from huggingface: https://huggingface.co/roberta-base/tree/main.
Then I run the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py:
python scripts/convert_bert_from_huggingface_to_uer.py --input_model_path /platform_tech/linjiayi/UER/huggingface_model/RoBERTa/pytorch_model.bin \
--output_model_path /platform_tech/linjiayi/UER/model/RoBERTa/uer_pytorch_model.bin \
--layers_num 12 --target mlm
There is an error:
What should I do?
Hi, If I want use English corpus to pre-train RoBERTa from scratch, Which dictionary should I use?
I use the vocab.json
downloaded from huggingface, but the instance show None:
python preprocess.py --corpus_path /platform_tech/linjiayi/dataset/codebase.jsonl --vocab_path /platform_tech/linjiayi/UER/huggingface_model/RoBERTa/vocab.json \
--dataset_path /platform_tech/linjiayi/UER/dataset/dataset.pt --processes_num 8 \
--dynamic_masking --target mlm
Do I need to rebuild the vocab myself?
You can download the model from huggingface and convert it to UER format by the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py
Hi, I have downloaded the RoBERTa from huggingface: https://huggingface.co/roberta-base/tree/main.
Then I run the script https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_huggingface_to_uer.py:
python scripts/convert_bert_from_huggingface_to_uer.py --input_model_path /platform_tech/linjiayi/UER/huggingface_model/RoBERTa/pytorch_model.bin \ --output_model_path /platform_tech/linjiayi/UER/model/RoBERTa/uer_pytorch_model.bin \ --layers_num 12 --target mlm
There is an error:
What should I do?
Hello, this is because the model parameter names do not match due to the too old version of Transformers used by the model. We have modified it in the latest code.
Hi, If I want use English corpus to pre-train RoBERTa from scratch, Which dictionary should I use? I use the
vocab.json
downloaded from huggingface, but the instance show None:python preprocess.py --corpus_path /platform_tech/linjiayi/dataset/codebase.jsonl --vocab_path /platform_tech/linjiayi/UER/huggingface_model/RoBERTa/vocab.json \ --dataset_path /platform_tech/linjiayi/UER/dataset/dataset.pt --processes_num 8 \ --dynamic_masking --target mlm
Do I need to rebuild the vocab myself?
You need to use models/google_uncased_en_vocab.txt
in the project file
Hello, this is because the model parameter names do not match due to the too old version of Transformers used by the model. We have modified it in the latest code.
@hhou435 Hi, I have tried the latest code, there is still an error:
Hi,You can use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py
Hi,You can use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py
The script can be used to convert successfully. But the transformed model seems a little different from Roberta. When I do incremental pre-training, there is an errer:
@hhou435 Hi, every time I train MLM with my own data, the log freezes after printing the following message:
Start slurm job at Thu 23 Dec 2021 02:41:12 PM CST
Using distributed mode for training.
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
It would be hours before the pre-training logs were printed:
The GPU was definitely being utilized during this time:
The pre-training data was processed in advance(dataste.pt). I wonder what the program was doing when the log stopped printing for a few hours? Loading data?
The loss of pre-training is shown below: Can the loss state be regarded as convergence?
Hi,You can use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py
The script can be used to convert successfully. But the transformed model seems a little different from Roberta. When I do incremental pre-training, there is an errer:
Hello, we found that the English Roberta uploaded by huggingface requires the use of bpe tokenizer. We currently support the use of bpe tokenzier. You need to specify --tokenizer bpe --vocab_path huggingface_gpt2_vocab.txt --merges_path huggingface_gpt2_merges.txt during pre-processing and pre-training.
We do not need a few hours to load the data, maybe your log is saved in the print buffer
@hhou435 Hi, every time I train MLM with my own data, the log freezes after printing the following message:
Start slurm job at Thu 23 Dec 2021 02:41:12 PM CST Using distributed mode for training. [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
It would be hours before the pre-training logs were printed:
The GPU was definitely being utilized during this time:
The pre-training data was processed in advance(dataste.pt). I wonder what the program was doing when the log stopped printing for a few hours? Loading data?
We do not need a few hours to load the data, maybe your log is saved in the print buffer?
The loss of pre-training is shown below: Can the loss state be regarded as convergence?
Have you shuffled the training data? The loss curve looks very fluctuating
Hi,You can use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py
The script can be used to convert successfully. But the transformed model seems a little different from Roberta. When I do incremental pre-training, there is an errer:
Hello, we found that the English Roberta uploaded by huggingface requires the use of bpe tokenizer. We currently support the use of bpe tokenzier. You need to specify --tokenizer bpe --vocab_path huggingface_gpt2_vocab.txt --merges_path huggingface_gpt2_merges.txt during pre-processing and pre-training.
Do I still use this script https://github.com/dbiir/UER-py/blob/master/scripts/convert_xlmroberta_from_huggingface_to_uer.py?
The loss of pre-training is shown below: Can the loss state be regarded as convergence?
Have you shuffled the training data? The loss curve looks very fluctuating
I used the Codesearchnet dataset for six programming languages and didn't mess with the data.
Here's the latest pre-training curve, can the loss state be regarded as convergence?
Hello, we found that the English Roberta uploaded by huggingface requires the use of bpe tokenizer. We currently support the use of bpe tokenzier. You need to specify --tokenizer bpe --vocab_path huggingface_gpt2_vocab.txt --merges_path huggingface_gpt2_merges.txt during pre-processing and pre-training.
Hi, I use the convert_xlmroberta_from_huggingface_to_uer.py
to convert Roberta. When I do incremental pre-training with the bpe tokenizer. , there is an error:
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for embedding.position_embedding.weight: copying a param with shape torch.Size([514, 768]) f
rom checkpoint, the shape in current model is torch.Size([512, 768]).
Hello, we found that the English Roberta uploaded by huggingface requires the use of bpe tokenizer. We currently support the use of bpe tokenzier. You need to specify --tokenizer bpe --vocab_path huggingface_gpt2_vocab.txt --merges_path huggingface_gpt2_merges.txt during pre-processing and pre-training.
Hi, I use the
convert_xlmroberta_from_huggingface_to_uer.py
to convert Roberta. When I do incremental pre-training with the bpe tokenizer. , there is an error:RuntimeError: Error(s) in loading state_dict for Model: size mismatch for embedding.position_embedding.weight: copying a param with shape torch.Size([514, 768]) f rom checkpoint, the shape in current model is torch.Size([512, 768]).
Because the structure of the roberta model uploaded by huggingface is different from that of the bert model, the weights do not match. You can use the configuration of the xlmroberta model. The structure of this model is exactly the same as the roberta model uploaded by huggingface. When using, you need to replace the file here with models/xlmroberta_special_tokens_map.json
I want to do incremental pre-training on the existing RoBERTa. Which RoBERTa model should I use? Download directly from Hugging Face? Do I need to script it into UER format after downloading it? Is there a conversion script if necessary?
RoBERTa's input corpus format is one document in a row. I am dealing with code corpus and want to change a sample into
natural language
+program language
. What should I do?https://github.com/dbiir/UER-py/blob/862526f2c256ec9d644c5fb99d9b0cbee77254f5/uer/utils/data.py#L336
Can I just change this to
[CLS] + natural language + [SEP] + program language + [EOS]
?