Closed tairen99 closed 3 years ago
你好,迁移训练的时候也是需要指定pretrained model字段,不是checkpoint字段,另外,如果再有问题的话,可以把完整的日志信息都提供出来
@littletomatodonkey,你好! 谢谢你的答复!
我可能搞混了这个 checkpoint字段 和 pretrained model字段。可以麻烦你解释一下这两个字段的区别吗? 什么时候用哪个? 根据官网,是继续之前的训练用 checkpoint字段,迁移训练用 pretrained model字段?
那如果我想在SAST Icidar2015 的模型继续训练我自己的数据(因为我想比较是不是用SAST icidar2015模型继续训练,训练损失收敛更快),我应该是用checkpoint,对吧?
我用了你们的 EAST icdar2015模型继续训练我自己的数据,我观察到确实我的训练损失收敛更快,而且没有报错。
但是,如果我采用同样的方法,用SAST icidar2015的模型,我就会得到上面的错误。 麻烦您查看如下完整的日志信息:
nohup: ignoring input
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
WARNING 2021-07-28 00:12:00,362 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
INFO 2021-07-28 00:12:00,364 launch_utils.py:471] Local start 4 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:38362 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... 0.1:48036,127.0.0.1:41564,127.0.0.1:47477|
| FLAGS_selected_gpus 0 |
+=======================================================================================+
INFO 2021-07-28 00:12:00,364 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
----------- Configuration Arguments -----------
gpus: 0,1,2,3
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers:
training_script: tools/train.py
training_script_args: ['-c', 'configs/det/det_r50_vd_sast_icdar15.yml', '-o', 'Global.checkpoints=./output/sast_origin_icdar/best_accuracy']
worker_num: None
workers:
------------------------------------------------
launch train in GPU mode
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
[2021/07/28 00:12:01] root INFO: Architecture :
[2021/07/28 00:12:01] root INFO: Backbone :
[2021/07/28 00:12:01] root INFO: layers : 50
[2021/07/28 00:12:01] root INFO: name : ResNet_SAST
[2021/07/28 00:12:01] root INFO: Head :
[2021/07/28 00:12:01] root INFO: name : SASTHead
[2021/07/28 00:12:01] root INFO: Neck :
[2021/07/28 00:12:01] root INFO: name : SASTFPN
[2021/07/28 00:12:01] root INFO: with_cab : True
[2021/07/28 00:12:01] root INFO: Transform : None
[2021/07/28 00:12:01] root INFO: algorithm : SAST
[2021/07/28 00:12:01] root INFO: model_type : det
[2021/07/28 00:12:01] root INFO: Eval :
[2021/07/28 00:12:01] root INFO: dataset :
[2021/07/28 00:12:01] root INFO: data_dir : ./train_data/paddle_train/text_localization/
[2021/07/28 00:12:01] root INFO: label_file_list : ['./train_data/paddle_train/text_localization/test_tag.txt']
[2021/07/28 00:12:01] root INFO: name : SimpleDataSet
[2021/07/28 00:12:01] root INFO: transforms :
[2021/07/28 00:12:01] root INFO: DecodeImage :
[2021/07/28 00:12:01] root INFO: channel_first : False
[2021/07/28 00:12:01] root INFO: img_mode : BGR
[2021/07/28 00:12:01] root INFO: DetLabelEncode : None
[2021/07/28 00:12:01] root INFO: DetResizeForTest :
[2021/07/28 00:12:01] root INFO: resize_long : 1536
[2021/07/28 00:12:01] root INFO: NormalizeImage :
[2021/07/28 00:12:01] root INFO: mean : [0.485, 0.456, 0.406]
[2021/07/28 00:12:01] root INFO: order : hwc
[2021/07/28 00:12:01] root INFO: scale : 1./255.
[2021/07/28 00:12:01] root INFO: std : [0.229, 0.224, 0.225]
[2021/07/28 00:12:01] root INFO: ToCHWImage : None
[2021/07/28 00:12:01] root INFO: KeepKeys :
[2021/07/28 00:12:01] root INFO: keep_keys : ['image', 'shape', 'polys', 'ignore_tags']
[2021/07/28 00:12:01] root INFO: loader :
[2021/07/28 00:12:01] root INFO: batch_size_per_card : 1
[2021/07/28 00:12:01] root INFO: drop_last : False
[2021/07/28 00:12:01] root INFO: num_workers : 1
[2021/07/28 00:12:01] root INFO: shuffle : False
[2021/07/28 00:12:01] root INFO: Global :
[2021/07/28 00:12:01] root INFO: cal_metric_during_train : False
[2021/07/28 00:12:01] root INFO: checkpoints : ./output/sast_origin_icdar/best_accuracy
[2021/07/28 00:12:01] root INFO: debug : False
[2021/07/28 00:12:01] root INFO: distributed : True
[2021/07/28 00:12:01] root INFO: epoch_num : 7000
[2021/07/28 00:12:01] root INFO: eval_batch_step : [0, 1000]
[2021/07/28 00:12:01] root INFO: infer_img : None
[2021/07/28 00:12:01] root INFO: load_static_weights : True
[2021/07/28 00:12:01] root INFO: log_smooth_window : 20
[2021/07/28 00:12:01] root INFO: pretrained_model : ./pretrain_models/ResNet50_vd_ssld_pretrained/
[2021/07/28 00:12:01] root INFO: print_batch_step : 2
[2021/07/28 00:12:01] root INFO: save_epoch_step : 100
[2021/07/28 00:12:01] root INFO: save_inference_dir : None
[2021/07/28 00:12:01] root INFO: save_model_dir : ./output/det_r50_sast/
[2021/07/28 00:12:01] root INFO: save_res_path : ./output/det_r50_sast/predicts_sast.txt
[2021/07/28 00:12:01] root INFO: use_gpu : True
[2021/07/28 00:12:01] root INFO: use_visualdl : False
[2021/07/28 00:12:01] root INFO: Loss :
[2021/07/28 00:12:01] root INFO: name : SASTLoss
[2021/07/28 00:12:01] root INFO: Metric :
[2021/07/28 00:12:01] root INFO: main_indicator : hmean
[2021/07/28 00:12:01] root INFO: name : DetMetric
[2021/07/28 00:12:01] root INFO: Optimizer :
[2021/07/28 00:12:01] root INFO: beta1 : 0.9
[2021/07/28 00:12:01] root INFO: beta2 : 0.999
[2021/07/28 00:12:01] root INFO: lr :
[2021/07/28 00:12:01] root INFO: learning_rate : 0.001
[2021/07/28 00:12:01] root INFO: name : Adam
[2021/07/28 00:12:01] root INFO: regularizer :
[2021/07/28 00:12:01] root INFO: factor : 0
[2021/07/28 00:12:01] root INFO: name : L2
[2021/07/28 00:12:01] root INFO: PostProcess :
[2021/07/28 00:12:01] root INFO: expand_scale : 1.0
[2021/07/28 00:12:01] root INFO: name : SASTPostProcess
[2021/07/28 00:12:01] root INFO: nms_thresh : 0.2
[2021/07/28 00:12:01] root INFO: sample_pts_num : 2
[2021/07/28 00:12:01] root INFO: score_thresh : 0.5
[2021/07/28 00:12:01] root INFO: shrink_ratio_of_width : 0.3
[2021/07/28 00:12:01] root INFO: Train :
[2021/07/28 00:12:01] root INFO: dataset :
[2021/07/28 00:12:01] root INFO: data_dir : ./train_data/paddle_train/text_localization/
[2021/07/28 00:12:01] root INFO: label_file_list : ['./train_data/paddle_train/text_localization/train_tag.txt']
[2021/07/28 00:12:01] root INFO: name : SimpleDataSet
[2021/07/28 00:12:01] root INFO: ratio_list : [1]
[2021/07/28 00:12:01] root INFO: transforms :
[2021/07/28 00:12:01] root INFO: DecodeImage :
[2021/07/28 00:12:01] root INFO: channel_first : False
[2021/07/28 00:12:01] root INFO: img_mode : BGR
[2021/07/28 00:12:01] root INFO: DetLabelEncode : None
[2021/07/28 00:12:01] root INFO: SASTProcessTrain :
[2021/07/28 00:12:01] root INFO: image_shape : [512, 512]
[2021/07/28 00:12:01] root INFO: max_text_size : 512
[2021/07/28 00:12:01] root INFO: min_crop_side_ratio : 0.3
[2021/07/28 00:12:01] root INFO: min_crop_size : 24
[2021/07/28 00:12:01] root INFO: min_text_size : 4
[2021/07/28 00:12:01] root INFO: KeepKeys :
[2021/07/28 00:12:01] root INFO: keep_keys : ['image', 'score_map', 'border_map', 'training_mask', 'tvo_map', 'tco_map']
[2021/07/28 00:12:01] root INFO: loader :
[2021/07/28 00:12:01] root INFO: batch_size_per_card : 8
[2021/07/28 00:12:01] root INFO: drop_last : False
[2021/07/28 00:12:01] root INFO: num_workers : 0
[2021/07/28 00:12:01] root INFO: shuffle : True
[2021/07/28 00:12:01] root INFO: train with paddle 2.0.0 and device CUDAPlace(0)
W0728 00:12:01.457197 1134 nccl_context.cc:142] Socket connect worker 127.0.0.1:41564 failed, try again after 3 seconds.
I0728 00:12:04.457449 1134 nccl_context.cc:189] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
W0728 00:12:04.807890 1134 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 11.0
W0728 00:12:04.810883 1134 device_context.cc:372] device: 0, cuDNN Version: 8.0.
[2021/07/28 00:12:07] root INFO: Initialize indexs of datasets:['./train_data/paddle_train/text_localization/train_tag.txt']
[2021/07/28 00:12:07] root INFO: Initialize indexs of datasets:['./train_data/paddle_train/text_localization/test_tag.txt']
[2021/07/28 00:12:11] root INFO: resume from ./output/sast_origin_icdar/best_accuracy
[2021/07/28 00:12:11] root INFO: train dataloader has 74 iters, valid dataloader has 540 iters
[2021/07/28 00:12:11] root INFO: During the training process, after the 0th iteration, an evaluation is run every 1000 iterations
[2021/07/28 00:12:11] root INFO: Initialize indexs of datasets:['./train_data/paddle_train/text_localization/train_tag.txt']
Traceback (most recent call last):
File "tools/train.py", line 120, in <module>
main(config, device, logger, vdl_writer)
File "tools/train.py", line 97, in main
eval_class, pre_best_model_dict, logger, vdl_writer)
INFO 2021-07-28 00:12:39,447 launch_utils.py:307] terminate all the procs
ERROR 2021-07-28 00:12:39,447 launch_utils.py:545] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2] was aborted. Please check its log.
INFO 2021-07-28 00:12:42,450 launch_utils.py:307] terminate all the procs
File "/home/PaddleOCR/tools/program.py", line 214, in train
optimizer.step()
File "<decorator-gen-198>", line 2, in step
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 260, in __impl__
return func(*args, **kwargs)
File "<decorator-gen-196>", line 2, in step
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
return wrapped_func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 225, in __impl__
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/adam.py", line 367, in step
loss=None, startup_program=None, params_grads=params_grads)
File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/optimizer.py", line 775, in _apply_optimize
optimize_ops = self._create_optimization_pass(params_grads)
File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/optimizer.py", line 597, in _create_optimization_pass
[p[0] for p in parameters_and_grads if p[0].trainable])
File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/adam.py", line 249, in _create_accumulators
self._add_moments_pows(p)
File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/adam.py", line 216, in _add_moments_pows
self._add_accumulator(self._moment1_acc_str, p, dtype=acc_dtype)
File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/optimizer.py", line 515, in _add_accumulator
"Optimizer set error, {} should in state dict".format( var_name )
AssertionError: Optimizer set error, conv1_1_weights_moment1_0 should in state dict
terminate called without an active exception
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::framework::SignalHandle(char const*, int)
1 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1627431140 (unix time) try "date -d @1627431140" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x46e) received by PID 1134 (TID 0x7f6eab7fe700) from PID 1134 ***]
还请您指点,谢谢!
@littletomatodonkey 好!
谢谢你的答复。
我找到了这个错误的原因:因为PaddleOCR训练 SAST Icidar 采用的是RMSProp optimizer,而默认的配置文件是 Adam,所以当我恢复之前的训练时,显示了optimizer的冲突。
谢谢
PaddleOCR 好!
我在根据我自己的数据, 采用文本检测 SAST 的方法训练时,如果我采用 https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/det_r50_vd_sast_icdar15_v2.0_train.tarhttps://paddleocr.bj.bcebos.com/dygraph_v2.0/en/det_r50_vd_sast_icdar15_v2.0_train.tar 的 ‘bestaccuracy’ 进行迁移训练时,我会得到如下的错误:
我的配置文件如下:
但是如果我从头进行训练,则没有任何问题。 不清楚原因是什么,还请指点。 多谢!