关于 AssertionError: Optimizer set error

PaddleOCR 好！

我在根据我自己的数据，采用文本检测 SAST 的方法训练时，如果我采用 https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/det_r50_vd_sast_icdar15_v2.0_train.tarhttps://paddleocr.bj.bcebos.com/dygraph_v2.0/en/det_r50_vd_sast_icdar15_v2.0_train.tar 的 ‘bestaccuracy’ 进行迁移训练时，我会得到如下的错误：

Traceback (most recent call last):
File "tools/train.py", line 120, in <module>
  main(config, device, logger, vdl_writer)
File "tools/train.py", line 97, in main
  eval_class, pre_best_model_dict, logger, vdl_writer)
INFO 2021-07-28 00:12:39,447 launch_utils.py:307] terminate all the procs
ERROR 2021-07-28 00:12:39,447 launch_utils.py:545] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2] was aborted. Please check its log.
INFO 2021-07-28 00:12:42,450 launch_utils.py:307] terminate all the procs
  File "/home/PaddleOCR/tools/program.py", line 214, in train
    optimizer.step()
  File "<decorator-gen-198>", line 2, in step
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 260, in __impl__
    return func(*args, **kwargs)
  File "<decorator-gen-196>", line 2, in step
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 225, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/adam.py", line 367, in step
    loss=None, startup_program=None, params_grads=params_grads)
  File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/optimizer.py", line 775, in _apply_optimize
    optimize_ops = self._create_optimization_pass(params_grads)
  File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/optimizer.py", line 597, in _create_optimization_pass
    [p[0] for p in parameters_and_grads if p[0].trainable])
  File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/adam.py", line 249, in _create_accumulators
    self._add_moments_pows(p)
  File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/adam.py", line 216, in _add_moments_pows
    self._add_accumulator(self._moment1_acc_str, p, dtype=acc_dtype)
  File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/optimizer.py", line 515, in _add_accumulator
    "Optimizer set error, {} should in state dict".format( var_name )
AssertionError: Optimizer set error, conv1_1_weights_moment1_0 should in state dict
terminate called without an active exception

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::SignalHandle(char const*, int)
1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1627431140 (unix time) try "date -d @1627431140" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x46e) received by PID 1134 (TID 0x7f6eab7fe700) from PID 1134 ***]

我的配置文件如下：

  Global:
    use_gpu: true
    epoch_num: 7000
    log_smooth_window: 20
    print_batch_step: 2
    save_model_dir: ./output/det_r50_sast/
    save_epoch_step: 100
    # evaluation is run every 5000 iterations after the 4000th iteration
    eval_batch_step: [0,1000]
    # if pretrained_model is saved in static mode, load_static_weights must set to True
    load_static_weights: True
    cal_metric_during_train: False
    pretrained_model: ./pretrain_models/ResNet50_vd_ssld_pretrained/
    checkpoints: 
    save_inference_dir:
    use_visualdl: False
    infer_img: 
    save_res_path: ./output/det_r50_sast/predicts_sast.txt

  Architecture:
    model_type: det
    algorithm: SAST
    Transform:
    Backbone:
      name: ResNet_SAST
      layers: 50
    Neck:
      name: SASTFPN
      with_cab: True
    Head:
      name: SASTHead

  Loss:
    name: SASTLoss

  Optimizer:
    name: Adam
    beta1: 0.9
    beta2: 0.999
    lr:
    #  name: Cosine
      learning_rate: 0.001
    #  warmup_epoch: 0
    regularizer:
      name: 'L2'
      factor: 0

  PostProcess:
    name: SASTPostProcess
    score_thresh: 0.5
    sample_pts_num: 2
    nms_thresh: 0.2
    expand_scale: 1.0
    shrink_ratio_of_width: 0.3

  Metric:
    name: DetMetric
    main_indicator: hmean

  Train:
    dataset:
      name: SimpleDataSet
      data_dir: ./train_data/paddle_train/text_localization/
      label_file_list:
        - ./train_data/paddle_train/text_localization/train_tag.txt
      ratio_list: [1]
      transforms:
        - DecodeImage: # load image
            img_mode: BGR
            channel_first: False
        - DetLabelEncode: # Class handling label
        - SASTProcessTrain:
            image_shape: [512, 512]
            min_crop_side_ratio: 0.3
            min_crop_size: 24
            min_text_size: 4
            max_text_size: 512
        - KeepKeys:
            keep_keys: ['image', 'score_map', 'border_map', 'training_mask', 'tvo_map', 'tco_map'] # dataloader will return list in this order
    loader:
      shuffle: True
      drop_last: False
      batch_size_per_card: 8
      num_workers: 0

  Eval:
    dataset:
      name: SimpleDataSet
      data_dir: ./train_data/paddle_train/text_localization/
      label_file_list:
        - ./train_data/paddle_train/text_localization/test_tag.txt
      transforms:
        - DecodeImage: # load image
            img_mode: BGR
            channel_first: False
        - DetLabelEncode: # Class handling label
        - DetResizeForTest:
            resize_long: 1536
        - NormalizeImage:
            scale: 1./255.
            mean: [0.485, 0.456, 0.406]
            std: [0.229, 0.224, 0.225]
            order: 'hwc'
        - ToCHWImage:
        - KeepKeys:
            keep_keys: ['image', 'shape', 'polys', 'ignore_tags']
    loader:
      shuffle: False
      drop_last: False
      batch_size_per_card: 1 # must be 1
      num_workers: 1

但是如果我从头进行训练，则没有任何问题。 不清楚原因是什么，还请指点。多谢！

@littletomatodonkey，你好！谢谢你的答复！

我可能搞混了这个 checkpoint字段和 pretrained model字段。可以麻烦你解释一下这两个字段的区别吗？什么时候用哪个? 根据官网，是继续之前的训练用 checkpoint字段，迁移训练用 pretrained model字段？

那如果我想在SAST Icidar2015 的模型继续训练我自己的数据（因为我想比较是不是用SAST icidar2015模型继续训练，训练损失收敛更快），我应该是用checkpoint，对吧？

我用了你们的 EAST icdar2015模型继续训练我自己的数据，我观察到确实我的训练损失收敛更快，而且没有报错。

但是，如果我采用同样的方法，用SAST icidar2015的模型，我就会得到上面的错误。麻烦您查看如下完整的日志信息：

    nohup: ignoring input
    grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
    WARNING 2021-07-28 00:12:00,362 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
    INFO 2021-07-28 00:12:00,364 launch_utils.py:471] Local start 4 processes. First process distributed environment info (Only For Debug): 
        +=======================================================================================+
        |                        Distributed Envs                      Value                    |
        +---------------------------------------------------------------------------------------+
        |                       PADDLE_TRAINER_ID                        0                      |
        |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:38362               |
        |                     PADDLE_TRAINERS_NUM                        4                      |
        |                PADDLE_TRAINER_ENDPOINTS  ... 0.1:48036,127.0.0.1:41564,127.0.0.1:47477|
        |                     FLAGS_selected_gpus                        0                      |
        +=======================================================================================+

    INFO 2021-07-28 00:12:00,364 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
    -----------  Configuration Arguments -----------
    gpus: 0,1,2,3
    heter_worker_num: None
    heter_workers: 
    http_port: None
    ips: 127.0.0.1
    log_dir: log
    nproc_per_node: None
    server_num: None
    servers: 
    training_script: tools/train.py
    training_script_args: ['-c', 'configs/det/det_r50_vd_sast_icdar15.yml', '-o', 'Global.checkpoints=./output/sast_origin_icdar/best_accuracy']
    worker_num: None
    workers: 
    ------------------------------------------------
    launch train in GPU mode
    grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
    [2021/07/28 00:12:01] root INFO: Architecture : 
    [2021/07/28 00:12:01] root INFO:     Backbone : 
    [2021/07/28 00:12:01] root INFO:         layers : 50
    [2021/07/28 00:12:01] root INFO:         name : ResNet_SAST
    [2021/07/28 00:12:01] root INFO:     Head : 
    [2021/07/28 00:12:01] root INFO:         name : SASTHead
    [2021/07/28 00:12:01] root INFO:     Neck : 
    [2021/07/28 00:12:01] root INFO:         name : SASTFPN
    [2021/07/28 00:12:01] root INFO:         with_cab : True
    [2021/07/28 00:12:01] root INFO:     Transform : None
    [2021/07/28 00:12:01] root INFO:     algorithm : SAST
    [2021/07/28 00:12:01] root INFO:     model_type : det
    [2021/07/28 00:12:01] root INFO: Eval : 
    [2021/07/28 00:12:01] root INFO:     dataset : 
    [2021/07/28 00:12:01] root INFO:         data_dir : ./train_data/paddle_train/text_localization/
    [2021/07/28 00:12:01] root INFO:         label_file_list : ['./train_data/paddle_train/text_localization/test_tag.txt']
    [2021/07/28 00:12:01] root INFO:         name : SimpleDataSet
    [2021/07/28 00:12:01] root INFO:         transforms : 
    [2021/07/28 00:12:01] root INFO:             DecodeImage : 
    [2021/07/28 00:12:01] root INFO:                 channel_first : False
    [2021/07/28 00:12:01] root INFO:                 img_mode : BGR
    [2021/07/28 00:12:01] root INFO:             DetLabelEncode : None
    [2021/07/28 00:12:01] root INFO:             DetResizeForTest : 
    [2021/07/28 00:12:01] root INFO:                 resize_long : 1536
    [2021/07/28 00:12:01] root INFO:             NormalizeImage : 
    [2021/07/28 00:12:01] root INFO:                 mean : [0.485, 0.456, 0.406]
    [2021/07/28 00:12:01] root INFO:                 order : hwc
    [2021/07/28 00:12:01] root INFO:                 scale : 1./255.
    [2021/07/28 00:12:01] root INFO:                 std : [0.229, 0.224, 0.225]
    [2021/07/28 00:12:01] root INFO:             ToCHWImage : None
    [2021/07/28 00:12:01] root INFO:             KeepKeys : 
    [2021/07/28 00:12:01] root INFO:                 keep_keys : ['image', 'shape', 'polys', 'ignore_tags']
    [2021/07/28 00:12:01] root INFO:     loader : 
    [2021/07/28 00:12:01] root INFO:         batch_size_per_card : 1
    [2021/07/28 00:12:01] root INFO:         drop_last : False
    [2021/07/28 00:12:01] root INFO:         num_workers : 1
    [2021/07/28 00:12:01] root INFO:         shuffle : False
    [2021/07/28 00:12:01] root INFO: Global : 
    [2021/07/28 00:12:01] root INFO:     cal_metric_during_train : False
    [2021/07/28 00:12:01] root INFO:     checkpoints : ./output/sast_origin_icdar/best_accuracy
    [2021/07/28 00:12:01] root INFO:     debug : False
    [2021/07/28 00:12:01] root INFO:     distributed : True
    [2021/07/28 00:12:01] root INFO:     epoch_num : 7000
    [2021/07/28 00:12:01] root INFO:     eval_batch_step : [0, 1000]
    [2021/07/28 00:12:01] root INFO:     infer_img : None
    [2021/07/28 00:12:01] root INFO:     load_static_weights : True
    [2021/07/28 00:12:01] root INFO:     log_smooth_window : 20
    [2021/07/28 00:12:01] root INFO:     pretrained_model : ./pretrain_models/ResNet50_vd_ssld_pretrained/
    [2021/07/28 00:12:01] root INFO:     print_batch_step : 2
    [2021/07/28 00:12:01] root INFO:     save_epoch_step : 100
    [2021/07/28 00:12:01] root INFO:     save_inference_dir : None
    [2021/07/28 00:12:01] root INFO:     save_model_dir : ./output/det_r50_sast/
    [2021/07/28 00:12:01] root INFO:     save_res_path : ./output/det_r50_sast/predicts_sast.txt
    [2021/07/28 00:12:01] root INFO:     use_gpu : True
    [2021/07/28 00:12:01] root INFO:     use_visualdl : False
    [2021/07/28 00:12:01] root INFO: Loss : 
    [2021/07/28 00:12:01] root INFO:     name : SASTLoss
    [2021/07/28 00:12:01] root INFO: Metric : 
    [2021/07/28 00:12:01] root INFO:     main_indicator : hmean
    [2021/07/28 00:12:01] root INFO:     name : DetMetric
    [2021/07/28 00:12:01] root INFO: Optimizer : 
    [2021/07/28 00:12:01] root INFO:     beta1 : 0.9
    [2021/07/28 00:12:01] root INFO:     beta2 : 0.999
    [2021/07/28 00:12:01] root INFO:     lr : 
    [2021/07/28 00:12:01] root INFO:         learning_rate : 0.001
    [2021/07/28 00:12:01] root INFO:     name : Adam
    [2021/07/28 00:12:01] root INFO:     regularizer : 
    [2021/07/28 00:12:01] root INFO:         factor : 0
    [2021/07/28 00:12:01] root INFO:         name : L2
    [2021/07/28 00:12:01] root INFO: PostProcess : 
    [2021/07/28 00:12:01] root INFO:     expand_scale : 1.0
    [2021/07/28 00:12:01] root INFO:     name : SASTPostProcess
    [2021/07/28 00:12:01] root INFO:     nms_thresh : 0.2
    [2021/07/28 00:12:01] root INFO:     sample_pts_num : 2
    [2021/07/28 00:12:01] root INFO:     score_thresh : 0.5
    [2021/07/28 00:12:01] root INFO:     shrink_ratio_of_width : 0.3
    [2021/07/28 00:12:01] root INFO: Train : 
    [2021/07/28 00:12:01] root INFO:     dataset : 
    [2021/07/28 00:12:01] root INFO:         data_dir : ./train_data/paddle_train/text_localization/
    [2021/07/28 00:12:01] root INFO:         label_file_list : ['./train_data/paddle_train/text_localization/train_tag.txt']
    [2021/07/28 00:12:01] root INFO:         name : SimpleDataSet
    [2021/07/28 00:12:01] root INFO:         ratio_list : [1]
    [2021/07/28 00:12:01] root INFO:         transforms : 
    [2021/07/28 00:12:01] root INFO:             DecodeImage : 
    [2021/07/28 00:12:01] root INFO:                 channel_first : False
    [2021/07/28 00:12:01] root INFO:                 img_mode : BGR
    [2021/07/28 00:12:01] root INFO:             DetLabelEncode : None
    [2021/07/28 00:12:01] root INFO:             SASTProcessTrain : 
    [2021/07/28 00:12:01] root INFO:                 image_shape : [512, 512]
    [2021/07/28 00:12:01] root INFO:                 max_text_size : 512
    [2021/07/28 00:12:01] root INFO:                 min_crop_side_ratio : 0.3
    [2021/07/28 00:12:01] root INFO:                 min_crop_size : 24
    [2021/07/28 00:12:01] root INFO:                 min_text_size : 4
    [2021/07/28 00:12:01] root INFO:             KeepKeys : 
    [2021/07/28 00:12:01] root INFO:                 keep_keys : ['image', 'score_map', 'border_map', 'training_mask', 'tvo_map', 'tco_map']
    [2021/07/28 00:12:01] root INFO:     loader : 
    [2021/07/28 00:12:01] root INFO:         batch_size_per_card : 8
    [2021/07/28 00:12:01] root INFO:         drop_last : False
    [2021/07/28 00:12:01] root INFO:         num_workers : 0
    [2021/07/28 00:12:01] root INFO:         shuffle : True
    [2021/07/28 00:12:01] root INFO: train with paddle 2.0.0 and device CUDAPlace(0)
    W0728 00:12:01.457197  1134 nccl_context.cc:142] Socket connect worker 127.0.0.1:41564 failed, try again after 3 seconds.
    I0728 00:12:04.457449  1134 nccl_context.cc:189] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
    W0728 00:12:04.807890  1134 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 11.0
    W0728 00:12:04.810883  1134 device_context.cc:372] device: 0, cuDNN Version: 8.0.
    [2021/07/28 00:12:07] root INFO: Initialize indexs of datasets:['./train_data/paddle_train/text_localization/train_tag.txt']
    [2021/07/28 00:12:07] root INFO: Initialize indexs of datasets:['./train_data/paddle_train/text_localization/test_tag.txt']
    [2021/07/28 00:12:11] root INFO: resume from ./output/sast_origin_icdar/best_accuracy
    [2021/07/28 00:12:11] root INFO: train dataloader has 74 iters, valid dataloader has 540 iters
    [2021/07/28 00:12:11] root INFO: During the training process, after the 0th iteration, an evaluation is run every 1000 iterations
    [2021/07/28 00:12:11] root INFO: Initialize indexs of datasets:['./train_data/paddle_train/text_localization/train_tag.txt']
    Traceback (most recent call last):
      File "tools/train.py", line 120, in <module>
        main(config, device, logger, vdl_writer)
      File "tools/train.py", line 97, in main
        eval_class, pre_best_model_dict, logger, vdl_writer)
    INFO 2021-07-28 00:12:39,447 launch_utils.py:307] terminate all the procs
    ERROR 2021-07-28 00:12:39,447 launch_utils.py:545] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2] was aborted. Please check its log.
    INFO 2021-07-28 00:12:42,450 launch_utils.py:307] terminate all the procs
      File "/home/PaddleOCR/tools/program.py", line 214, in train
        optimizer.step()
      File "<decorator-gen-198>", line 2, in step
      File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 260, in __impl__
        return func(*args, **kwargs)
      File "<decorator-gen-196>", line 2, in step
      File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
        return wrapped_func(*args, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 225, in __impl__
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/adam.py", line 367, in step
        loss=None, startup_program=None, params_grads=params_grads)
      File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/optimizer.py", line 775, in _apply_optimize
        optimize_ops = self._create_optimization_pass(params_grads)
      File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/optimizer.py", line 597, in _create_optimization_pass
        [p[0] for p in parameters_and_grads if p[0].trainable])
      File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/adam.py", line 249, in _create_accumulators
        self._add_moments_pows(p)
      File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/adam.py", line 216, in _add_moments_pows
        self._add_accumulator(self._moment1_acc_str, p, dtype=acc_dtype)
      File "/usr/local/lib/python3.7/dist-packages/paddle/optimizer/optimizer.py", line 515, in _add_accumulator
        "Optimizer set error, {} should in state dict".format( var_name )
    AssertionError: Optimizer set error, conv1_1_weights_moment1_0 should in state dict
    terminate called without an active exception

    --------------------------------------
    C++ Traceback (most recent call last):
    --------------------------------------
    0   paddle::framework::SignalHandle(char const*, int)
    1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

    ----------------------
    Error Message Summary:
    ----------------------
    FatalError: `Process abort signal` is detected by the operating system.
      [TimeInfo: *** Aborted at 1627431140 (unix time) try "date -d @1627431140" if you are using GNU date ***]
      [SignalInfo: *** SIGABRT (@0x46e) received by PID 1134 (TID 0x7f6eab7fe700) from PID 1134 ***]

还请您指点，谢谢！

PaddlePaddle / PaddleOCR

关于 AssertionError: Optimizer set error #3443