PaddlePaddle / ERNIE

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.
6.31k stars 1.28k forks source link

ernie-doc预训练脚本执行问题请协助。 #762

Closed gaoyuan211 closed 2 years ago

gaoyuan211 commented 3 years ago

目前已搭建环境,并执行到如下两个截图,无法进展下一步,请协助,非常感谢! 执行脚本: sudo bash scripts/run_dureader.sh

输出日志为: ----------- Configuration Arguments ----------- current_node_ip: 127.0.1.1 node_id: 0 node_ips: 127.0.1.1 nproc_per_node: 4 print_config: True selected_gpus: 0,1,2,3 split_log_path: log training_script: run_mrc.py training_script_args: ['--use_cuda', 'true', '--is_distributed', 'true', '--batch_size', '16', '--in_tokens', 'false', '--use_fast_executor', 'true', '--checkpoints', './output', '--vocab_path', './configs/base/zh/vocab.txt', '--do_train', 'true', '--do_val', 'true', '--do_test', 'false', '--save_steps', '10000', '--validation_steps', '100', '--warmup_proportion', '0.1', '--weight_decay', '0.01', '--epoch', '5', '--max_seq_len', '512', '--ernie_config_path', './configs/base/zh/ernie_config.json', '--do_lower_case', 'true', '--doc_stride', '128', '--train_set', './data/finetune/task_data/dureader//train.json', '--dev_set', './data/finetune/task_data/dureader//dev.json', '--test_set', './data/finetune/task_data/dureader//test.json', '--learning_rate', '2.75e-4', '--num_iteration_per_drop_scope', '1', '--lr_scheduler', 'linear_warmup_decay', '--layer_decay_ratio', '0.8', '--is_zh', 'True', '--repeat_input', 'False', '--train_all', 'Fasle', '--eval_all', 'False', '--use_vars', 'False', '--init_checkpoint', '', '--init_pretraining_params', '', '--init_loss_scaling', '32768', '--use_recompute', 'False', '--skip_steps', '10']

all_trainer_endpoints: 127.0.1.1:6170,127.0.1.1:6171,127.0.1.1:6172,127.0.1.1:6173 , node_id: 0 , current_ip: 127.0.1.1 , num_nodes: 1 , node_ips: ['127.0.1.1'] , gpus_per_proc: 1 , selected_gpus_per_proc: [['0'], ['1'], ['2'], ['3']] , nranks: 4

gaoyuan211 commented 3 years ago

执行脚本在shell上的完整输出如下: sudo bash scripts/run_dureader.sh

gaoyuan211 commented 3 years ago

jog.0日志输出

/usr/lib/python3/dist-packages/urllib3/util/selectors.py:14: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working from collections import namedtuple, Mapping /usr/lib/python3/dist-packages/urllib3/_collections.py:2: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working from collections import Mapping, MutableMapping /usr/lib/python3/dist-packages/setuptools/depends.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp ----------- Configuration Arguments ----------- batch_size: 16 checkpoints: ./output decr_every_n_nan_or_inf: 2 decr_ratio: 0.8 dev_set: ./data/finetune/task_data/dureader//dev.json do_lower_case: True do_test: False do_train: True do_val: True doc_stride: 128 epoch: 5 ernie_config_path: ./configs/base/zh/ernie_config.json eval_all: False for_cn: True in_tokens: False incr_every_n_steps: 100 incr_ratio: 5.0 init_checkpoint: init_loss_scaling: 32768.0 init_pretraining_params: is_distributed: True is_zh: True label_map_config: None layer_decay_ratio: 0.8 learning_rate: 0.000275 lr_scheduler: linear_warmup_decay max_answer_length: 100 max_query_length: 64 max_seq_len: 512 metrics: True n_best_size: 20 num_iteration_per_drop_scope: 1 num_labels: 2 random_seed: 0 rel_pos_params_sharing: False repeat_input: False save_steps: 10000 skip_steps: 10 stream_job: None test_set: ./data/finetune/task_data/dureader//test.json tokenizer: FullTokenizer train_all: False train_set: ./data/finetune/task_data/dureader//train.json use_amp: False use_cuda: True use_dynamic_loss_scaling: False use_fast_executor: True use_recompute: False use_vars: False validation_steps: 100 verbose: False vocab_path: ./configs/base/zh/vocab.txt warmup_proportion: 0.1 weight_decay: 0.01 weight_sharing: True

finetuning start attention_probs_dropout_prob: 0.1 epsilon: 1e-12 hidden_act: gelu hidden_dropout_prob: 0.1 hidden_size: 768 initializer_range: 0.02 max_position_embeddings: 512 memory_len: 128 num_attention_heads: 12 num_hidden_layers: 12 sent_type_vocab_size: 4 task_type_vocab_size: 3 vocab_size: 28000

args.is_distributed: True worker_endpoints:['127.0.1.1:6170'] trainers_num:1 current_endpoint:127.0.1.1:6170 trainer_id:0 Device count 1, trainer_id:0 args.vocab_path ./configs/base/zh/vocab.txt Traceback (most recent call last): File "run_mrc.py", line 322, in main(args) File "run_mrc.py", line 131, in main phase="train") File "/home/ubuntu/ERNIE/ernie-doc/reader/task_reader.py", line 683, in data_generator examples, features = self._pre_process_data(phase, input_file) File "/home/ubuntu/ERNIE/ernie-doc/reader/task_reader.py", line 663, in _pre_process_data assert os.path.exists(data_path), "%s is not exist !" % self.config.data_path AttributeError: 'MRCReader' object has no attribute 'config'

gaoyuan211 commented 3 years ago

test.json 日志里指定的预训练数据为json后缀,readme里生成的是txt。 这里好像是冲突了,但是不知道咋解决。

gaoyuan211 commented 3 years ago

下载官方数据 (Download data)

http://ai.stanford.edu/~amaas/data/sentiment/index.html

运行预处理脚本 (Run preprocessing code)

python multi_files_to_one.py # this will generate train/test txt

生成train.txt与test.txt文件至该文件夹下

gaoyuan211 commented 3 years ago

stream_job: None test_set: ./data/imdb//test.json tokenizer: FullTokenizer train_all: False train_set: ./data/imdb//train.json use_amp: False use_cuda: True 以上是日志里的文件路径

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reopen it. Thank you for your contributions.