PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.08k stars 2.93k forks source link

[Question]: uie 模型微调 finetune.py 出错 #4074

Closed ping40 closed 1 year ago

ping40 commented 1 year ago

请提出你的问题

简单描述问题:

按照https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie#%E8%AE%AD%E7%BB%83%E5%AE%9A%E5%88%B6 中的训练定制,在执行 模型微调 步骤中 出现异常:

Traceback (most recent call last):
  File "finetune.py", line 245, in <module>
    main()
  File "finetune.py", line 184, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddlenlp/trainer/trainer.py", line 614, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddlenlp/trainer/trainer.py", line 1253, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddlenlp/trainer/trainer.py", line 1215, in compute_loss
    outputs = model(**inputs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 774, in forward
    outputs = self._layers(*inputs, **kwargs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
TypeError: forward() got an unexpected keyword argument 'pos_ids'
I1210 16:53:00.477530 40141 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
LAUNCH INFO 2022-12-10 16:53:01,906 Exit code 1

------------------------------------------ 以下是详细信息 --------------------------

错误现象:

(my_paddlenlp) [ping400@localhost uie]$ python -u -m paddle.distributed.launch --gpus "0,1" finetune.py     --device gpu     --logging_steps 10     --save_steps 100     --eval_steps 100     --seed 42     --model_name_or_path uie-base     --output_dir $finetuned_model     --train_path data/train.txt     --dev_path data/dev.txt      --max_seq_length 512      --per_device_eval_batch_size 16     --per_device_train_batch_size  16     --num_train_epochs 100     --learning_rate 1e-5     --do_train     --do_eval     --do_export     --export_model_dir $finetuned_model     --label_names 'start_positions' 'end_positions'     --overwrite_output_dir     --disable_tqdm True     --metric_for_best_model eval_f1     --load_best_model_at_end  True     --save_total_limit 1

LAUNCH INFO 2022-12-10 16:52:47,836 -----------  Configuration  ----------------------
LAUNCH INFO 2022-12-10 16:52:47,837 devices: 0,1
LAUNCH INFO 2022-12-10 16:52:47,837 elastic_level: -1
LAUNCH INFO 2022-12-10 16:52:47,837 elastic_timeout: 30
LAUNCH INFO 2022-12-10 16:52:47,837 gloo_port: 6767
LAUNCH INFO 2022-12-10 16:52:47,837 host: None
LAUNCH INFO 2022-12-10 16:52:47,837 ips: None
LAUNCH INFO 2022-12-10 16:52:47,837 job_id: default
LAUNCH INFO 2022-12-10 16:52:47,837 legacy: False
LAUNCH INFO 2022-12-10 16:52:47,837 log_dir: log
LAUNCH INFO 2022-12-10 16:52:47,837 log_level: INFO
LAUNCH INFO 2022-12-10 16:52:47,837 master: None
LAUNCH INFO 2022-12-10 16:52:47,837 max_restart: 3
LAUNCH INFO 2022-12-10 16:52:47,837 nnodes: 1
LAUNCH INFO 2022-12-10 16:52:47,838 nproc_per_node: None
LAUNCH INFO 2022-12-10 16:52:47,838 rank: -1
LAUNCH INFO 2022-12-10 16:52:47,838 run_mode: collective
LAUNCH INFO 2022-12-10 16:52:47,838 server_num: None
LAUNCH INFO 2022-12-10 16:52:47,838 servers:
LAUNCH INFO 2022-12-10 16:52:47,838 start_port: 6070
LAUNCH INFO 2022-12-10 16:52:47,838 trainer_num: None
LAUNCH INFO 2022-12-10 16:52:47,838 trainers:
LAUNCH INFO 2022-12-10 16:52:47,838 training_script: finetune.py
LAUNCH INFO 2022-12-10 16:52:47,838 training_script_args: ['--device', 'gpu', '--logging_steps', '10', '--save_steps', '100', '--eval_steps', '100', '--seed', '42', '--model_name_or_path', 'uie-base', '--output_dir', './checkpoint/model_best', '--train_path', 'data/train.txt', '--dev_path', 'data/dev.txt', '--max_seq_length', '512', '--per_device_eval_batch_size', '16', '--per_device_train_batch_size', '16', '--num_train_epochs', '100', '--learning_rate', '1e-5', '--do_train', '--do_eval', '--do_export', '--export_model_dir', './checkpoint/model_best', '--label_names', 'start_positions', 'end_positions', '--overwrite_output_dir', '--disable_tqdm', 'True', '--metric_for_best_model', 'eval_f1', '--load_best_model_at_end', 'True', '--save_total_limit', '1']
LAUNCH INFO 2022-12-10 16:52:47,838 with_gloo: 1
LAUNCH INFO 2022-12-10 16:52:47,838 --------------------------------------------------
LAUNCH INFO 2022-12-10 16:52:47,839 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2022-12-10 16:52:47,849 Run Pod: fgpevr, replicas 2, status ready
LAUNCH INFO 2022-12-10 16:52:47,884 Watching Pod: fgpevr, replicas 2, status running
/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
[2022-12-10 16:52:50,785] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.
[2022-12-10 16:52:50,785] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2022-12-10 16:52:50,785] [    INFO] - ============================================================
[2022-12-10 16:52:50,785] [    INFO] -      Model Configuration Arguments
[2022-12-10 16:52:50,786] [    INFO] - paddle commit id              :4743cc8b9a8d77ea47e08a42a16246b538bda56f
[2022-12-10 16:52:50,786] [    INFO] - export_model_dir              :./checkpoint/model_best
[2022-12-10 16:52:50,786] [    INFO] - model_name_or_path            :uie-base
[2022-12-10 16:52:50,786] [    INFO] - multilingual                  :False
[2022-12-10 16:52:50,786] [    INFO] -
[2022-12-10 16:52:50,786] [    INFO] - ============================================================
[2022-12-10 16:52:50,786] [    INFO] -       Data Configuration Arguments
[2022-12-10 16:52:50,786] [    INFO] - paddle commit id              :4743cc8b9a8d77ea47e08a42a16246b538bda56f
[2022-12-10 16:52:50,786] [    INFO] - dev_path                      :data/dev.txt
[2022-12-10 16:52:50,786] [    INFO] - max_seq_length                :512
[2022-12-10 16:52:50,786] [    INFO] - train_path                    :data/train.txt
[2022-12-10 16:52:50,786] [    INFO] -
I1210 16:52:50.787261 40063 tcp_utils.cc:181] The server starts to listen on IP_ANY:36395
I1210 16:52:50.787516 40063 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36395
W1210 16:52:56.094343 40063 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
W1210 16:52:56.099543 40063 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[2022-12-10 16:52:56,736] [ WARNING] - Process rank: 0, device: gpu, world_size: 2, distributed training: True, 16-bits training: False
[2022-12-10 16:52:56,737] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-base'.
[2022-12-10 16:52:56,737] [    INFO] - Already cached /home/ping400/.paddlenlp/models/uie-base/ernie_3.0_base_zh_vocab.txt
[2022-12-10 16:52:56,769] [    INFO] - tokenizer config file saved in /home/ping400/.paddlenlp/models/uie-base/tokenizer_config.json
[2022-12-10 16:52:56,770] [    INFO] - Special tokens file saved in /home/ping400/.paddlenlp/models/uie-base/special_tokens_map.json
[2022-12-10 16:52:56,770] [    INFO] - Already cached /home/ping400/.paddlenlp/models/uie-base/uie_base.pdparams
[2022-12-10 16:52:58,546] [    INFO] - ============================================================
[2022-12-10 16:52:58,547] [    INFO] -     Training Configuration Arguments
[2022-12-10 16:52:58,547] [    INFO] - paddle commit id              :4743cc8b9a8d77ea47e08a42a16246b538bda56f
[2022-12-10 16:52:58,547] [    INFO] - _no_sync_in_gradient_accumulation:True
[2022-12-10 16:52:58,547] [    INFO] - activation_quantize_type      :None
[2022-12-10 16:52:58,547] [    INFO] - adam_beta1                    :0.9
[2022-12-10 16:52:58,547] [    INFO] - adam_beta2                    :0.999
[2022-12-10 16:52:58,547] [    INFO] - adam_epsilon                  :1e-08
[2022-12-10 16:52:58,547] [    INFO] - algo_list                     :None
[2022-12-10 16:52:58,547] [    INFO] - batch_num_list                :None
[2022-12-10 16:52:58,548] [    INFO] - batch_size_list               :None
[2022-12-10 16:52:58,548] [    INFO] - bf16                          :False
[2022-12-10 16:52:58,548] [    INFO] - bf16_full_eval                :False
[2022-12-10 16:52:58,548] [    INFO] - bias_correction               :False
[2022-12-10 16:52:58,548] [    INFO] - current_device                :gpu:0
[2022-12-10 16:52:58,548] [    INFO] - dataloader_drop_last          :False
[2022-12-10 16:52:58,548] [    INFO] - dataloader_num_workers        :0
[2022-12-10 16:52:58,548] [    INFO] - device                        :gpu
[2022-12-10 16:52:58,548] [    INFO] - disable_tqdm                  :True
[2022-12-10 16:52:58,548] [    INFO] - do_compress                   :False
[2022-12-10 16:52:58,548] [    INFO] - do_eval                       :True
[2022-12-10 16:52:58,549] [    INFO] - do_export                     :True
[2022-12-10 16:52:58,549] [    INFO] - do_predict                    :False
[2022-12-10 16:52:58,549] [    INFO] - do_train                      :True
[2022-12-10 16:52:58,549] [    INFO] - eval_batch_size               :16
[2022-12-10 16:52:58,549] [    INFO] - eval_steps                    :100
[2022-12-10 16:52:58,549] [    INFO] - evaluation_strategy           :IntervalStrategy.STEPS
[2022-12-10 16:52:58,549] [    INFO] - fp16                          :False
[2022-12-10 16:52:58,549] [    INFO] - fp16_full_eval                :False
[2022-12-10 16:52:58,549] [    INFO] - fp16_opt_level                :O1
[2022-12-10 16:52:58,549] [    INFO] - gradient_accumulation_steps   :1
[2022-12-10 16:52:58,549] [    INFO] - greater_is_better             :True
[2022-12-10 16:52:58,550] [    INFO] - ignore_data_skip              :False
[2022-12-10 16:52:58,550] [    INFO] - input_infer_model_path        :None
[2022-12-10 16:52:58,550] [    INFO] - label_names                   :['start_positions', 'end_positions']
[2022-12-10 16:52:58,550] [    INFO] - learning_rate                 :1e-05
[2022-12-10 16:52:58,550] [    INFO] - load_best_model_at_end        :True
[2022-12-10 16:52:58,550] [    INFO] - local_process_index           :0
[2022-12-10 16:52:58,550] [    INFO] - local_rank                    :0
[2022-12-10 16:52:58,550] [    INFO] - log_level                     :-1
[2022-12-10 16:52:58,550] [    INFO] - log_level_replica             :-1
[2022-12-10 16:52:58,550] [    INFO] - log_on_each_node              :True
[2022-12-10 16:52:58,550] [    INFO] - logging_dir                   :./checkpoint/model_best/runs/Dec10_16-52-50_localhost.localdomain
[2022-12-10 16:52:58,550] [    INFO] - logging_first_step            :False
[2022-12-10 16:52:58,551] [    INFO] - logging_steps                 :10
[2022-12-10 16:52:58,551] [    INFO] - logging_strategy              :IntervalStrategy.STEPS
[2022-12-10 16:52:58,551] [    INFO] - lr_scheduler_type             :SchedulerType.LINEAR
[2022-12-10 16:52:58,551] [    INFO] - max_grad_norm                 :1.0
[2022-12-10 16:52:58,551] [    INFO] - max_steps                     :-1
[2022-12-10 16:52:58,551] [    INFO] - metric_for_best_model         :eval_f1
[2022-12-10 16:52:58,551] [    INFO] - minimum_eval_times            :None
[2022-12-10 16:52:58,551] [    INFO] - moving_rate                   :0.9
[2022-12-10 16:52:58,551] [    INFO] - no_cuda                       :False
[2022-12-10 16:52:58,551] [    INFO] - num_train_epochs              :100.0
[2022-12-10 16:52:58,551] [    INFO] - onnx_format                   :True
[2022-12-10 16:52:58,551] [    INFO] - optim                         :OptimizerNames.ADAMW
[2022-12-10 16:52:58,552] [    INFO] - output_dir                    :./checkpoint/model_best
[2022-12-10 16:52:58,552] [    INFO] - overwrite_output_dir          :True
[2022-12-10 16:52:58,552] [    INFO] - past_index                    :-1
[2022-12-10 16:52:58,552] [    INFO] - per_device_eval_batch_size    :16
[2022-12-10 16:52:58,552] [    INFO] - per_device_train_batch_size   :16
[2022-12-10 16:52:58,552] [    INFO] - prediction_loss_only          :False
[2022-12-10 16:52:58,552] [    INFO] - process_index                 :0
[2022-12-10 16:52:58,552] [    INFO] - recompute                     :False
[2022-12-10 16:52:58,552] [    INFO] - remove_unused_columns         :True
[2022-12-10 16:52:58,552] [    INFO] - report_to                     :['visualdl']
[2022-12-10 16:52:58,552] [    INFO] - resume_from_checkpoint        :None
[2022-12-10 16:52:58,553] [    INFO] - round_type                    :round
[2022-12-10 16:52:58,553] [    INFO] - run_name                      :./checkpoint/model_best
[2022-12-10 16:52:58,553] [    INFO] - save_on_each_node             :False
[2022-12-10 16:52:58,553] [    INFO] - save_steps                    :100
[2022-12-10 16:52:58,553] [    INFO] - save_strategy                 :IntervalStrategy.STEPS
[2022-12-10 16:52:58,553] [    INFO] - save_total_limit              :1
[2022-12-10 16:52:58,553] [    INFO] - scale_loss                    :32768
[2022-12-10 16:52:58,553] [    INFO] - seed                          :42
[2022-12-10 16:52:58,553] [    INFO] - sharding                      :[]
[2022-12-10 16:52:58,553] [    INFO] - sharding_degree               :-1
[2022-12-10 16:52:58,553] [    INFO] - should_log                    :True
[2022-12-10 16:52:58,553] [    INFO] - should_save                   :True
[2022-12-10 16:52:58,554] [    INFO] - strategy                      :dynabert+ptq
[2022-12-10 16:52:58,554] [    INFO] - train_batch_size              :16
[2022-12-10 16:52:58,554] [    INFO] - use_pact                      :True
[2022-12-10 16:52:58,554] [    INFO] - warmup_ratio                  :0.1
[2022-12-10 16:52:58,554] [    INFO] - warmup_steps                  :0
[2022-12-10 16:52:58,554] [    INFO] - weight_decay                  :0.0
[2022-12-10 16:52:58,554] [    INFO] - weight_quantize_type          :channel_wise_abs_max
[2022-12-10 16:52:58,554] [    INFO] - width_mult_list               :None
[2022-12-10 16:52:58,554] [    INFO] - world_size                    :2
[2022-12-10 16:52:58,554] [    INFO] -
[2022-12-10 16:52:58,595] [    INFO] - ***** Running training *****
[2022-12-10 16:52:58,595] [    INFO] -   Num examples = 240
[2022-12-10 16:52:58,595] [    INFO] -   Num Epochs = 100
[2022-12-10 16:52:58,596] [    INFO] -   Instantaneous batch size per device = 16
[2022-12-10 16:52:58,596] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 32
[2022-12-10 16:52:58,596] [    INFO] -   Gradient Accumulation steps = 1
[2022-12-10 16:52:58,597] [    INFO] -   Total optimization steps = 800.0
[2022-12-10 16:52:58,597] [    INFO] -   Total num train samples = 24000.0
[2022-12-10 16:52:58,746] [    INFO] -   Number of trainable parameters = 117946370
Traceback (most recent call last):
  File "finetune.py", line 245, in <module>
    main()
  File "finetune.py", line 184, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddlenlp/trainer/trainer.py", line 614, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddlenlp/trainer/trainer.py", line 1253, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddlenlp/trainer/trainer.py", line 1215, in compute_loss
    outputs = model(**inputs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 774, in forward
    outputs = self._layers(*inputs, **kwargs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
TypeError: forward() got an unexpected keyword argument 'pos_ids'
I1210 16:53:00.477530 40141 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
LAUNCH INFO 2022-12-10 16:53:01,904 Pod failed
LAUNCH ERROR 2022-12-10 16:53:01,905 Container failed !!!
Container rank 0 status failed cmd ['/home/ping400/anaconda3/envs/my_paddlenlp/bin/python', '-u', 'finetune.py', '--device', 'gpu', '--logging_steps', '10', '--save_steps', '100', '--eval_steps', '100', '--seed', '42', '--model_name_or_path', 'uie-base', '--output_dir', './checkpoint/model_best', '--train_path', 'data/train.txt', '--dev_path', 'data/dev.txt', '--max_seq_length', '512', '--per_device_eval_batch_size', '16', '--per_device_train_batch_size', '16', '--num_train_epochs', '100', '--learning_rate', '1e-5', '--do_train', '--do_eval', '--do_export', '--export_model_dir', './checkpoint/model_best', '--label_names', 'start_positions', 'end_positions', '--overwrite_output_dir', '--disable_tqdm', 'True', '--metric_for_best_model', 'eval_f1', '--load_best_model_at_end', 'True', '--save_total_limit', '1'] code 1 log log/workerlog.0
env {'HOSTNAME': 'localhost.localdomain', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'HISTSIZE': '8192', 'finetuned_model': './checkpoint/model_best', 'SSH_CLIENT': '131.3.100.85 64094 55355', 'CONDA_SHLVL': '2', 'CONDA_PROMPT_MODIFIER': '(my_paddlenlp) ', 'OLDPWD': '/home/ping400', 'SSH_TTY': '/dev/pts/1', 'USER': 'ping400', 'LD_LIBRARY_PATH': '/home/ping400/code/PaddleNLP/model_zoo/uie/data/nccl_2.15.5-1+cuda10.2_x86_64/lib/:/home/ping400/anaconda3/envs/my_paddlenlp/lib/:', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:', 'CONDA_EXE': '/home/ping400/anaconda3/bin/conda', '_CE_CONDA': '', 'CONDA_PREFIX_1': '/home/ping400/anaconda3', 'MAIL': '/var/spool/mail/ping400', 'PATH': '/home/ping400/anaconda3/envs/my_paddlenlp/bin:/home/ping400/anaconda3/condabin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/sbin:/home/ping400/.local/bin:/home/ping400/bin:/home/ping400/tools/git-run/bin:/home/ping400/tools/jdk/jdk-18.0.1.1/bin:/home/ping400/tools/rar', 'CONDA_PREFIX': '/home/ping400/anaconda3/envs/my_paddlenlp', 'PWD': '/home/ping400/code/PaddleNLP/model_zoo/uie', 'LANG': 'en_US.UTF-8', '_CE_M': '', 'HISTCONTROL': 'ignoredups', 'SHLVL': '1', 'HOME': '/home/ping400', 'CONDA_PYTHON_EXE': '/home/ping400/anaconda3/bin/python', 'LOGNAME': 'ping400', 'SSH_CONNECTION': '131.3.100.85 64094 192.168.49.4 55355', 'CONDA_DEFAULT_ENV': 'my_paddlenlp', 'LESSOPEN': '||/usr/bin/lesspipe.sh %s', 'HISTFILE': '/tmp/record/ping400/20221210/131.3.100.85@ping400.16:38:04', '_': '/home/ping400/anaconda3/envs/my_paddlenlp/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'fgpevr', 'PADDLE_MASTER': '127.0.0.1:36395', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '127.0.0.1:36396,127.0.0.1:36397', 'PADDLE_CURRENT_ENDPOINT': '127.0.0.1:36396', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'}
LAUNCH INFO 2022-12-10 16:53:01,905 ------------------------- ERROR LOG DETAIL -------------------------
2-12-10 16:52:58,554] [    INFO] - train_batch_size              :16
[2022-12-10 16:52:58,554] [    INFO] - use_pact                      :True
[2022-12-10 16:52:58,554] [    INFO] - warmup_ratio                  :0.1
[2022-12-10 16:52:58,554] [    INFO] - warmup_steps                  :0
[2022-12-10 16:52:58,554] [    INFO] - weight_decay                  :0.0
[2022-12-10 16:52:58,554] [    INFO] - weight_quantize_type          :channel_wise_abs_max
[2022-12-10 16:52:58,554] [    INFO] - width_mult_list               :None
[2022-12-10 16:52:58,554] [    INFO] - world_size                    :2
[2022-12-10 16:52:58,554] [    INFO] -
[2022-12-10 16:52:58,595] [    INFO] - ***** Running training *****
[2022-12-10 16:52:58,595] [    INFO] -   Num examples = 240
[2022-12-10 16:52:58,595] [    INFO] -   Num Epochs = 100
[2022-12-10 16:52:58,596] [    INFO] -   Instantaneous batch size per device = 16
[2022-12-10 16:52:58,596] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 32
[2022-12-10 16:52:58,596] [    INFO] -   Gradient Accumulation steps = 1
[2022-12-10 16:52:58,597] [    INFO] -   Total optimization steps = 800.0
[2022-12-10 16:52:58,597] [    INFO] -   Total num train samples = 24000.0
[2022-12-10 16:52:58,746] [    INFO] -   Number of trainable parameters = 117946370
Traceback (most recent call last):
  File "finetune.py", line 245, in <module>
    main()
  File "finetune.py", line 184, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddlenlp/trainer/trainer.py", line 614, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddlenlp/trainer/trainer.py", line 1253, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddlenlp/trainer/trainer.py", line 1215, in compute_loss
    outputs = model(**inputs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 774, in forward
    outputs = self._layers(*inputs, **kwargs)
  File "/home/ping400/anaconda3/envs/my_paddlenlp/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
TypeError: forward() got an unexpected keyword argument 'pos_ids'
I1210 16:53:00.477530 40141 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
LAUNCH INFO 2022-12-10 16:53:01,906 Exit code 1
(my_paddlenlp) [ping400@localhost uie]$

环境:


conda list
# packages in environment at /home/ping400/anaconda3/envs/my_paddlenlp:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
aiohttp                   3.8.3                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
anyio                     3.6.2                    pypi_0    pypi
astor                     0.8.1                    pypi_0    pypi
async-timeout             4.0.2                    pypi_0    pypi
asynctest                 0.13.0                   pypi_0    pypi
attrs                     22.1.0                   pypi_0    pypi
babel                     2.11.0                   pypi_0    pypi
bce-python-sdk            0.8.74                   pypi_0    pypi
blas                      1.0                         mkl
brotlipy                  0.7.0           py37h27cfd23_1003
ca-certificates           2022.10.11           h06a4308_0
certifi                   2022.9.24        py37h06a4308_0
cffi                      1.15.1           py37h5eee18b_3
charset-normalizer        2.1.1                    pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
colorama                  0.4.6                    pypi_0    pypi
colorlog                  6.7.0                    pypi_0    pypi
commonmark                0.9.1                    pypi_0    pypi
cryptography              38.0.1           py37h9ce1e76_0
cudatoolkit               10.2.89              hfd86e86_1
cudnn                     7.6.5                cuda10.2_0
cycler                    0.11.0                   pypi_0    pypi
datasets                  2.7.1                    pypi_0    pypi
decorator                 5.1.1              pyhd3eb1b0_0
dill                      0.3.4                    pypi_0    pypi
fastapi                   0.88.0                   pypi_0    pypi
filelock                  3.8.2                    pypi_0    pypi
flask                     2.2.2                    pypi_0    pypi
flask-babel               2.0.0                    pypi_0    pypi
fonttools                 4.38.0                   pypi_0    pypi
freetype                  2.12.1               h4a9f257_0
frozenlist                1.3.3                    pypi_0    pypi
fsspec                    2022.11.0                pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
giflib                    5.2.1                h7b6447c_0
h11                       0.14.0                   pypi_0    pypi
huggingface-hub           0.11.1                   pypi_0    pypi
idna                      3.4              py37h06a4308_0
importlib-metadata        5.1.0                    pypi_0    pypi
intel-openmp              2021.4.0          h06a4308_3561
itsdangerous              2.1.2                    pypi_0    pypi
jieba                     0.42.1                   pypi_0    pypi
jinja2                    3.1.2                    pypi_0    pypi
joblib                    1.2.0                    pypi_0    pypi
jpeg                      9e                   h7f8727e_0
kiwisolver                1.4.4                    pypi_0    pypi
lcms2                     2.12                 h3be6417_0
ld_impl_linux-64          2.38                 h1181459_1
lerc                      3.0                  h295c915_0
libdeflate                1.8                  h7f8727e_5
libffi                    3.4.2                h6a678d5_6
libgcc-ng                 11.2.0               h1234567_1
libgomp                   11.2.0               h1234567_1
libpng                    1.6.37               hbc83047_0
libprotobuf               3.19.1               h4ff587b_0
libstdcxx-ng              11.2.0               h1234567_1
libtiff                   4.4.0                hecacb30_2
libwebp                   1.2.4                h11a3e52_0
libwebp-base              1.2.4                h5eee18b_0
lz4-c                     1.9.3                h295c915_1
markupsafe                2.1.1                    pypi_0    pypi
matplotlib                3.5.3                    pypi_0    pypi
mkl                       2021.4.0           h06a4308_640
mkl-service               2.4.0            py37h7f8727e_0
mkl_fft                   1.3.1            py37hd3c417c_0
mkl_random                1.2.2            py37h51133e4_0
multidict                 6.0.3                    pypi_0    pypi
multiprocess              0.70.12.2                pypi_0    pypi
ncurses                   6.3                  h5eee18b_3
numpy                     1.21.6                   pypi_0    pypi
numpy-base                1.21.5           py37ha15fc14_3
openssl                   1.1.1s               h7f8727e_0
opt_einsum                3.3.0              pyhd3eb1b0_1
packaging                 22.0                     pypi_0    pypi
paddle2onnx               1.0.5                    pypi_0    pypi
paddlefsl                 1.1.0                    pypi_0    pypi
paddlenlp                 2.4.5                    pypi_0    pypi
paddlepaddle-gpu          2.4.1                    pypi_0    pypi
pandas                    1.3.5                    pypi_0    pypi
pillow                    9.3.0                    pypi_0    pypi
pip                       22.3.1           py37h06a4308_0
protobuf                  3.20.0                   pypi_0    pypi
pyarrow                   10.0.1                   pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0
pycryptodome              3.16.0                   pypi_0    pypi
pydantic                  1.10.2                   pypi_0    pypi
pygments                  2.13.0                   pypi_0    pypi
pyopenssl                 22.0.0             pyhd3eb1b0_0
pyparsing                 3.0.9                    pypi_0    pypi
pysocks                   1.7.1                    py37_1
python                    3.7.15               h7a1cb2a_1
python-dateutil           2.8.2                    pypi_0    pypi
pytz                      2022.6                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
readline                  8.2                  h5eee18b_0
requests                  2.28.1           py37h06a4308_0
responses                 0.18.0                   pypi_0    pypi
rich                      12.6.0                   pypi_0    pypi
scikit-learn              1.0.2                    pypi_0    pypi
scipy                     1.7.3                    pypi_0    pypi
sentencepiece             0.1.97                   pypi_0    pypi
seqeval                   1.2.2                    pypi_0    pypi
setuptools                65.5.0           py37h06a4308_0
six                       1.16.0             pyhd3eb1b0_1
sniffio                   1.3.0                    pypi_0    pypi
sqlite                    3.40.0               h5082296_0
starlette                 0.22.0                   pypi_0    pypi
threadpoolctl             3.1.0                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0
tqdm                      4.64.1                   pypi_0    pypi
typer                     0.7.0                    pypi_0    pypi
typing-extensions         4.4.0                    pypi_0    pypi
urllib3                   1.26.13          py37h06a4308_0
uvicorn                   0.20.0                   pypi_0    pypi
visualdl                  2.4.1                    pypi_0    pypi
werkzeug                  2.2.2                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0
xxhash                    3.1.0                    pypi_0    pypi
xz                        5.2.8                h5eee18b_0
yarl                      1.8.2                    pypi_0    pypi
zipp                      3.11.0                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_0
zstd                      1.5.2                ha4553b6_0
(my_paddlenlp) [ping400@localhost uie]$

gpu 卡信息:

]$ nvidia-smi
Sat Dec 10 17:03:21 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.95.01    Driver Version: 440.95.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:5A:00.0 Off |                    0 |
| N/A   29C    P0    22W / 250W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:5E:00.0 Off |                    0 |
| N/A   29C    P0    35W / 250W |   2524MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   26C    P0    22W / 250W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:66:00.0 Off |                    0 |
| N/A   26C    P0    21W / 250W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      8340      C   python3                                     1247MiB |
|    1      8577      C   ...dfdfd/anaconda3/envs/april/bin/python3  1265MiB |
+-----------------------------------------------------------------------------+

训练数据 通过 doccano 来完成。 内容见附件; doccano_ext.json.txt

谢谢

linjieccc commented 1 year ago

@ping40 Hi,

拉一下最新的develop分支代码试试

AmSure commented 1 year ago

更新最新代码后我这里还是报错 1670754881649

ping40 commented 1 year ago

我原来 是 最新代码的。 后来我 切换到 release/2.4 这个问题 消失了。 看来是代码匹配问题导致的。

我的 conda 依赖的是: 【 paddlenlp 2.4.5 pypi_0 pypi 】