pgnet training with aistudio

PureHing commented 3 years ago

你好，请问我使用AISTUDIO训练遇到如下的log问题，配置文件使用的是默认的，加载pretrain,没有跑起来，请问如何解决。另外，用自己电脑训练虽然跑起来了，但是loss的打印无法得到咱们给出的精度。

aistudio training log

aistudio@jupyter-54944-1860424:~/PaddleOCR$ python3 tools/train.py -c configs/e2e/e2e_r50_vd_pg.yml -o Global.pretrained_model=./pretrain_models/train_step1/best_accuracy Global.load_static_weights=False /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def convert_to_list(value, n, name, dtype=np.int): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/skimage/morphology/_skeletonize.py:241: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations 0, 1, 1, 0, 0, 1, 0, 0, 0], dtype=np.bool) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/skimage/morphology/_skeletonize.py:256: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=np.bool) [2021/06/11 08:44:46] root INFO: Architecture : [2021/06/11 08:44:46] root INFO: Backbone : [2021/06/11 08:44:46] root INFO: layers : 50 [2021/06/11 08:44:46] root INFO: name : ResNet [2021/06/11 08:44:46] root INFO: Head : [2021/06/11 08:44:46] root INFO: name : PGHead [2021/06/11 08:44:46] root INFO: Neck : [2021/06/11 08:44:46] root INFO: name : PGFPN [2021/06/11 08:44:46] root INFO: Transform : None [2021/06/11 08:44:46] root INFO: algorithm : PGNet [2021/06/11 08:44:46] root INFO: model_type : e2e [2021/06/11 08:44:46] root INFO: Eval : [2021/06/11 08:44:46] root INFO: dataset : [2021/06/11 08:44:46] root INFO: data_dir : /home/aistudio/total_text/test [2021/06/11 08:44:46] root INFO: label_file_list : ['/home/aistudio/total_text/test/test.txt'] [2021/06/11 08:44:46] root INFO: name : PGDataSet [2021/06/11 08:44:46] root INFO: transforms : [2021/06/11 08:44:46] root INFO: DecodeImage : [2021/06/11 08:44:46] root INFO: channel_first : False [2021/06/11 08:44:46] root INFO: img_mode : RGB [2021/06/11 08:44:46] root INFO: E2ELabelEncodeTest : None [2021/06/11 08:44:46] root INFO: E2EResizeForTest : [2021/06/11 08:44:46] root INFO: max_side_len : 768 [2021/06/11 08:44:46] root INFO: NormalizeImage : [2021/06/11 08:44:46] root INFO: mean : [0.485, 0.456, 0.406] [2021/06/11 08:44:46] root INFO: order : hwc [2021/06/11 08:44:46] root INFO: scale : 1./255. [2021/06/11 08:44:46] root INFO: std : [0.229, 0.224, 0.225] [2021/06/11 08:44:46] root INFO: ToCHWImage : None [2021/06/11 08:44:46] root INFO: KeepKeys : [2021/06/11 08:44:46] root INFO: keep_keys : ['image', 'shape', 'polys', 'texts', 'ignore_tags', 'img_id'] [2021/06/11 08:44:46] root INFO: loader : [2021/06/11 08:44:46] root INFO: batch_size_per_card : 1 [2021/06/11 08:44:46] root INFO: drop_last : False [2021/06/11 08:44:46] root INFO: num_workers : 1 [2021/06/11 08:44:46] root INFO: shuffle : False [2021/06/11 08:44:46] root INFO: Global : [2021/06/11 08:44:46] root INFO: cal_metric_during_train : False [2021/06/11 08:44:46] root INFO: character_dict_path : ppocr/utils/ic15_dict.txt [2021/06/11 08:44:46] root INFO: character_type : EN [2021/06/11 08:44:46] root INFO: checkpoints : None [2021/06/11 08:44:46] root INFO: debug : False [2021/06/11 08:44:46] root INFO: distributed : False [2021/06/11 08:44:46] root INFO: epoch_num : 600 [2021/06/11 08:44:46] root INFO: eval_batch_step : [0, 1000] [2021/06/11 08:44:46] root INFO: infer_img : None [2021/06/11 08:44:46] root INFO: load_static_weights : False [2021/06/11 08:44:46] root INFO: log_smooth_window : 20 [2021/06/11 08:44:46] root INFO: max_text_length : 50 [2021/06/11 08:44:46] root INFO: max_text_nums : 30 [2021/06/11 08:44:46] root INFO: pretrained_model : ./pretrain_models/train_step1/best_accuracy [2021/06/11 08:44:46] root INFO: print_batch_step : 10 [2021/06/11 08:44:46] root INFO: save_epoch_step : 10 [2021/06/11 08:44:46] root INFO: save_inference_dir : None [2021/06/11 08:44:46] root INFO: save_model_dir : ./output/pgnet_r50_vd_totaltext/ [2021/06/11 08:44:46] root INFO: save_res_path : ./output/pgnet_r50_vd_totaltext/predicts_pgnet.txt [2021/06/11 08:44:46] root INFO: tcl_len : 64 [2021/06/11 08:44:46] root INFO: use_gpu : True [2021/06/11 08:44:46] root INFO: use_visualdl : False [2021/06/11 08:44:46] root INFO: valid_set : totaltext [2021/06/11 08:44:46] root INFO: Loss : [2021/06/11 08:44:46] root INFO: max_text_length : 50 [2021/06/11 08:44:46] root INFO: max_text_nums : 30 [2021/06/11 08:44:46] root INFO: name : PGLoss [2021/06/11 08:44:46] root INFO: pad_num : 36 [2021/06/11 08:44:46] root INFO: tcl_bs : 64 [2021/06/11 08:44:46] root INFO: Metric : [2021/06/11 08:44:46] root INFO: character_dict_path : ppocr/utils/ic15_dict.txt [2021/06/11 08:44:46] root INFO: gt_mat_dir : ./train_data/total_text/gt [2021/06/11 08:44:46] root INFO: main_indicator : f_score_e2e [2021/06/11 08:44:46] root INFO: mode : A [2021/06/11 08:44:46] root INFO: name : E2EMetric [2021/06/11 08:44:46] root INFO: Optimizer : [2021/06/11 08:44:46] root INFO: beta1 : 0.9 [2021/06/11 08:44:46] root INFO: beta2 : 0.999 [2021/06/11 08:44:46] root INFO: lr : [2021/06/11 08:44:46] root INFO: learning_rate : 0.001 [2021/06/11 08:44:46] root INFO: name : Adam [2021/06/11 08:44:46] root INFO: regularizer : [2021/06/11 08:44:46] root INFO: factor : 0 [2021/06/11 08:44:46] root INFO: name : L2 [2021/06/11 08:44:46] root INFO: PostProcess : [2021/06/11 08:44:46] root INFO: mode : fast [2021/06/11 08:44:46] root INFO: name : PGPostProcess [2021/06/11 08:44:46] root INFO: score_thresh : 0.5 [2021/06/11 08:44:46] root INFO: Train : [2021/06/11 08:44:46] root INFO: dataset : [2021/06/11 08:44:46] root INFO: data_dir : /home/aistudio/total_text/train [2021/06/11 08:44:46] root INFO: label_file_list : ['/home/aistudio/total_text/train/train.txt'] [2021/06/11 08:44:46] root INFO: name : PGDataSet [2021/06/11 08:44:46] root INFO: ratio_list : [1.0] [2021/06/11 08:44:46] root INFO: transforms : [2021/06/11 08:44:46] root INFO: DecodeImage : [2021/06/11 08:44:46] root INFO: channel_first : False [2021/06/11 08:44:46] root INFO: img_mode : BGR [2021/06/11 08:44:46] root INFO: E2ELabelEncodeTrain : None [2021/06/11 08:44:46] root INFO: PGProcessTrain : [2021/06/11 08:44:46] root INFO: batch_size : 14 [2021/06/11 08:44:46] root INFO: max_text_size : 512 [2021/06/11 08:44:46] root INFO: min_crop_size : 24 [2021/06/11 08:44:46] root INFO: min_text_size : 4 [2021/06/11 08:44:46] root INFO: KeepKeys : [2021/06/11 08:44:46] root INFO: keep_keys : ['images', 'tcl_maps', 'tcl_label_maps', 'border_maps', 'direction_maps', 'training_masks', 'label_list', 'pos_list', 'pos_mask'] [2021/06/11 08:44:46] root INFO: loader : [2021/06/11 08:44:46] root INFO: batch_size_per_card : 14 [2021/06/11 08:44:46] root INFO: drop_last : True [2021/06/11 08:44:46] root INFO: num_workers : 2 [2021/06/11 08:44:46] root INFO: shuffle : True [2021/06/11 08:44:46] root INFO: train with paddle 2.0.2 and device CUDAPlace(0) [2021/06/11 08:44:46] root INFO: Initialize indexs of datasets:['/home/aistudio/total_text/train/train.txt'] [2021/06/11 08:44:46] root INFO: Initialize indexs of datasets:['/home/aistudio/total_text/test/test.txt'] W0611 08:44:46.740801 627 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1 W0611 08:44:46.745793 627 device_context.cc:372] device: 0, cuDNN Version: 7.6. [2021/06/11 08:44:53] root INFO: load pretrained model from ['./pretrain_models/train_step1/best_accuracy'] [2021/06/11 08:44:53] root INFO: train dataloader has 89 iters [2021/06/11 08:44:53] root INFO: valid dataloader has 300 iters [2021/06/11 08:44:53] root INFO: During the training process, after the 0th iteration, an evaluation is run every 1000 iterations [2021/06/11 08:44:53] root INFO: Initialize indexs of datasets:['/home/aistudio/total_text/train/train.txt'] 2021-06-11 08:44:59,061 - ERROR - DataLoader reader thread raised an exception! Traceback (most recent call last): File "tools/train.py", line 125, in main(config, device, logger, vdl_writer) File "tools/train.py", line 102, in main eval_class, pre_best_model_dict, logger, vdl_writer) File "/home/aistudio/PaddleOCR/tools/program.py", line 204, in train Exception in thread Thread-1: Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 684, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 616, in _thread_loop batch = self._get_data() File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 700, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 2 workers exit unexpectedly, pids: 728, 729 for idx, batch in enumerate(train_dataloader): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 779, in __next__ data = self._reader.read_next_var_list() SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception. [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:158)

WenmuZhou commented 3 years ago

num_workers需要设为0

LDOUBLEV commented 3 years ago

AIstudio共享内存比较小，可以把配置文件里的use_shared_memory设置为False

PureHing commented 3 years ago

@WenmuZhou @LDOUBLEV 好的　谢谢

精度log

[2021/06/14 17:37:16] root INFO: save model in ./output/pgnet_r50_vd_totaltext/latest [2021/06/14 17:37:18] root INFO: save model in ./output/pgnet_r50_vd_totaltext/iter_epoch_600 [2021/06/14 17:37:18] root INFO: best metric, f_score_e2e: 0.3859576235978396, total_num_gt: 2543, total_num_det: 2271, global_accumulative_recall: 1655.599999999995, hit_str_count: 929, recall: 0.651042076287847, precision: 0.7409951563188005, f_score: 0.6931122441109999, seqerr: 0.43887412418458394, recall_e2e: 0.3653165552497051, precision_e2e: 0.4090708938793483, fps: 15.278823602738289, best_epoch: 589,

请问这个训练精度还有什么办法可以提升吗？比readme低了好多。

PureHing commented 3 years ago

@WenmuZhou @LDOUBLEV 好的　谢谢

精度log 请问这个训练精度还有什么办法可以提升吗？比readme低了好多。

您好可以帮忙看一下吗

PureHing commented 3 years ago

@LDOUBLEV 您好可以帮忙看一下吗

@WenmuZhou @LDOUBLEV 好的　谢谢

精度log 请问这个训练精度还有什么办法可以提升吗？比readme低了好多。

Evezerest commented 3 years ago

数据增广，优化器这些调整一下试试？

paddle-bot-old[bot] commented 2 years ago

Since you haven\'t replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于您超过三个月未回复，我们将关闭这个issue/pr。若问题未解决或有后续问题，请随时重新打开（建议先拉取最新代码进行尝试），我们会继续跟进。

PaddlePaddle / PaddleOCR

pgnet training with aistudio #3086