Open smallhaozigithub opened 2 years ago
具体使用的是什么模型以及backbone?
该恢复训练找不到优化器参数的问题已经在PaddleX 2.1.0中修复了,详见https://github.com/PaddlePaddle/PaddleX/pull/1197 。我们使用PaddleX 2.1.0,PPYOLO + ResNet50_vd_dcn没有复现出该问题。
可以使用pip show paddlex
确认一下使用的PaddleX版本。
呀。。。用的就是PaddleX 2.1.0,PPYOLO + ResNet50_vd_dcn 呀。我确认下看看。
也可以在import paddex
之后print(paddlex.__version__)
看一下。
Name: paddlex
Version: 2.1.0
Summary: PaddlePaddle End-to-End Development Toolkit
Home-page: https://github.com/PaddlePaddle/PaddleX
Author: paddlex
Author-email: paddlex@baidu.com
License: Apache 2.0
Location: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages
Requires: chardet, colorama, flask-cors, lap, motmetrics, opencv-python, openpyxl, paddleslim, pycocotools, pyyaml, scikit-learn, scipy, shapely, tqdm, visualdl
Required-by:
paddlex是2.1.0版本
我在aistudio,和本地跑都是同样的。
都是一样的报错。
方便提供能复现出这个报错的代码吗?
稍等会哈。
我们使用你的代码,以Python脚本的形式运行是没有问题的,在AI Studio中重启环境之后再恢复训练也是没问题的。只在不重启环境直接重新运行恢复训练的单元格的情况下复现出了这个报错。
此外,PPYOLO+ResNet50_vd_dcn模型本身就是没有报错信息中的conv2d_79
这个卷积层的(最后一层卷积编号为78)。因此,推测可能是因为AI Studio中有未清理的缓存导致的。
可以尝试一下在恢复训练前手动重启环境,看是否还会出现报错。
我在aistudio重启环境。把项目关闭,重新开启项目。都试过了,也还是这个情况 报错。 这就很尴尬了。。
很奇怪,我本地测试了一下,把jupyter notebook 重启了,可以正常恢复训练。aistudio上不行。 我再研究下看看。
同样的问题 同样的2.1 重启之后还是有一样的报错...
重启完内核,训练的多启动几次。
我用paddlex的pdx.det.FasterRCNN训练自己的数据集,前面数据处理都没问题,但最后
num_classes = len(train_dataset.labels) + 1
model = pdx.det.FasterRCNN(num_classes=num_classes)
model.train(
num_epochs=12,
train_dataset=train_dataset,
train_batch_size=2,
eval_dataset=eval_dataset,
learning_rate=0.0025,
lr_decay_epochs=[4, 8],
save_interval_epochs=1,
save_dir='output/faster_rcnn_r50_fpn',
pretrain_weights='IMAGENET',
use_vdl=True)
报错:
!!! The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list.
CPU_NUM indicates that how many CPUPlace are used in the current task.
And if this parameter are set as N (equal to the number of physical CPU core) the program may be faster.
export CPU_NUM=64 # for example, set CPU_NUM as number of physical CPU core which is 64.
!!! The default number of CPU_NUM=1.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/math_op_patch.py:416: DeprecationWarning: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlex/cv/nets/detection/fpn.py:172
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
category=DeprecationWarning)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_336/206520623.py in <module>
11 save_dir='output/faster_rcnn_r50_fpn',
12 pretrain_weights='IMAGENET',
---> 13 use_vdl=True)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlex/cv/models/faster_rcnn.py in train(self, num_epochs, train_dataset, train_batch_size, eval_dataset, save_interval_epochs, log_interval_steps, save_dir, pretrain_weights, optimizer, learning_rate, warmup_steps, warmup_start_lr, lr_decay_epochs, lr_decay_gamma, metric, use_vdl, early_stop, early_stop_patience, resume_checkpoint, sensitivities_file, eval_metric_loss)
347 self.optimizer = optimizer
348 # 构建训练、验证、测试网络
--> 349 self.build_program()
350 fuse_bn = True
351 if self.with_fpn and self.backbone in [
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlex/cv/models/base.py in build_program(self)
103 paddlex.model_built = True
104 # 构建训练网络
--> 105 self.train_inputs, self.train_outputs = self.build_net(mode='train')
106 self.train_prog = fluid.default_main_program()
107 startup_prog = fluid.default_startup_program()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlex/cv/models/faster_rcnn.py in build_net(self, mode)
222 inputs = model.generate_inputs()
223 if mode == 'train':
--> 224 model_out = model.build_net(inputs)
225 loss = model_out['loss']
226 self.optimizer.minimize(loss)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlex/cv/nets/detection/faster_rcnn.py in build_net(self, inputs)
224 body_feats, spatial_scale = self.fpn.get_output(body_feats)
225
--> 226 rois = self.rpn_head.get_proposals(body_feats, im_info, mode=self.mode)
227
228 if self.mode == 'train':
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlex/cv/nets/detection/rpn_head.py in get_proposals(self, fpn_feats, im_info, mode)
608 fpn_feat = fpn_feats[fpn_feat_name]
609 rois_fpn, roi_probs_fpn = self._get_single_proposals(
--> 610 fpn_feat, im_info, lvl, mode)
611 self.fpn_rpn_list.append((self.rpn_cls_score, self.rpn_bbox_pred))
612 rois_list.append(rois_fpn)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlex/cv/nets/detection/rpn_head.py in _get_single_proposals(self, body_feat, im_info, feat_lvl, mode)
571 nms_thresh=self.train_nms_thresh,
572 min_size=self.train_min_size,
--> 573 eta=self.train_eta)
574 else:
575 rpn_rois_fpn, rpn_roi_prob_fpn = fluid.layers.generate_proposals(
ValueError: too many values to unpack (expected 2)
这是为什么?
aistudio上跑的paddlex。版本paddlex-2.1.0
报错内容