Closed vinceyzw closed 1 year ago
你的服务器GPU型号是?使用的PaddleOCR 版本是?
除了修改代码路径还修改了什么地方
你把aistudio的代码打包下载下来在你服务器上训练试试?
你的服务器GPU型号是?使用的PaddleOCR 版本是?
除了修改代码路径还修改了什么地方
你把aistudio的代码打包下载下来在你服务器上训练试试?
GPU是T4 使用的paddleOCR版本:release/2.1 只修改了数据路径
把aistudio的代码打包下载下来在自己服务器上训练也不行,我刚试过。 @LDOUBLEV
感觉还是环境的问题啊! 我的conda环境如下:
#
_libgcc_mutex 0.1 main defaults
appdirs 1.4.4
@LDOUBLEV 大佬再帮忙看下啊
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
背景:在参加飞桨中文轻量级文字识别大赛,使用配置文件rec_chinese_lite_train_v2.0.yml,在ai-stidio可以正常训练,到24epoch左右acc有0.39左右;但是自己服务器上配置飞桨环境,训练了很多次,500epoch后acc只有0.005左右。 同时,我又在自己服务器上试过rec_chinese_common_train_v2.0.yml,是可以正常训练的。
服务器环境: cuda10.1 paddlepaddle-gpu=2.0.2
下面列出自己服务器上重新训练的一次日志,目前到43个epoch,acc也只有0.0039 [2021/05/26 18:21:44] root INFO: Architecture : [2021/05/26 18:21:44] root INFO: Backbone : [2021/05/26 18:21:44] root INFO: model_name : small [2021/05/26 18:21:44] root INFO: name : MobileNetV3 [2021/05/26 18:21:44] root INFO: scale : 0.5 [2021/05/26 18:21:44] root INFO: small_stride : [1, 2, 2, 2] [2021/05/26 18:21:44] root INFO: Head : [2021/05/26 18:21:44] root INFO: fc_decay : 1e-05 [2021/05/26 18:21:44] root INFO: name : CTCHead [2021/05/26 18:21:44] root INFO: Neck : [2021/05/26 18:21:44] root INFO: encoder_type : rnn [2021/05/26 18:21:44] root INFO: hidden_size : 48 [2021/05/26 18:21:44] root INFO: name : SequenceEncoder [2021/05/26 18:21:44] root INFO: Transform : None [2021/05/26 18:21:44] root INFO: algorithm : CRNN [2021/05/26 18:21:44] root INFO: model_type : rec [2021/05/26 18:21:44] root INFO: Eval : [2021/05/26 18:21:44] root INFO: dataset : [2021/05/26 18:21:44] root INFO: data_dir : /ssd/lost+found/data/ppocr/训练数据集/TrainImages/ [2021/05/26 18:21:44] root INFO: label_file_list : ['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] [2021/05/26 18:21:44] root INFO: name : SimpleDataSet [2021/05/26 18:21:44] root INFO: transforms : [2021/05/26 18:21:44] root INFO: DecodeImage : [2021/05/26 18:21:44] root INFO: channel_first : False [2021/05/26 18:21:44] root INFO: img_mode : BGR [2021/05/26 18:21:44] root INFO: CTCLabelEncode : None [2021/05/26 18:21:44] root INFO: RecResizeImg : [2021/05/26 18:21:44] root INFO: image_shape : [3, 32, 320] [2021/05/26 18:21:44] root INFO: KeepKeys : [2021/05/26 18:21:44] root INFO: keep_keys : ['image', 'label', 'length'] [2021/05/26 18:21:44] root INFO: loader : [2021/05/26 18:21:44] root INFO: batch_size_per_card : 256 [2021/05/26 18:21:44] root INFO: drop_last : False [2021/05/26 18:21:44] root INFO: num_workers : 0 [2021/05/26 18:21:44] root INFO: shuffle : False [2021/05/26 18:21:44] root INFO: Global : [2021/05/26 18:21:44] root INFO: cal_metric_during_train : True [2021/05/26 18:21:44] root INFO: character_dict_path : ppocr/utils/ppocr_keys_v1.txt [2021/05/26 18:21:44] root INFO: character_type : ch [2021/05/26 18:21:44] root INFO: checkpoints : None [2021/05/26 18:21:44] root INFO: debug : False [2021/05/26 18:21:44] root INFO: distributed : False [2021/05/26 18:21:44] root INFO: epoch_num : 500 [2021/05/26 18:21:44] root INFO: eval_batch_step : [0, 2000] [2021/05/26 18:21:44] root INFO: infer_img : doc/imgs_words/ch/word_1.jpg [2021/05/26 18:21:44] root INFO: infer_mode : False [2021/05/26 18:21:44] root INFO: log_smooth_window : 20 [2021/05/26 18:21:44] root INFO: max_text_length : 25 [2021/05/26 18:21:44] root INFO: pretrained_model : None [2021/05/26 18:21:44] root INFO: print_batch_step : 10 [2021/05/26 18:21:44] root INFO: save_epoch_step : 3 [2021/05/26 18:21:44] root INFO: save_inference_dir : None [2021/05/26 18:21:44] root INFO: save_model_dir : ./output/rec_chinese_lite_v2.0 [2021/05/26 18:21:44] root INFO: save_res_path : ./output/rec/predicts_chinese_lite_v2.0.txt [2021/05/26 18:21:44] root INFO: use_gpu : True [2021/05/26 18:21:44] root INFO: use_space_char : True [2021/05/26 18:21:44] root INFO: use_visualdl : False [2021/05/26 18:21:44] root INFO: Loss : [2021/05/26 18:21:44] root INFO: name : CTCLoss [2021/05/26 18:21:44] root INFO: Metric : [2021/05/26 18:21:44] root INFO: main_indicator : acc [2021/05/26 18:21:44] root INFO: name : RecMetric [2021/05/26 18:21:44] root INFO: Optimizer : [2021/05/26 18:21:44] root INFO: beta1 : 0.9 [2021/05/26 18:21:44] root INFO: beta2 : 0.999 [2021/05/26 18:21:44] root INFO: lr : [2021/05/26 18:21:44] root INFO: learning_rate : 0.001 [2021/05/26 18:21:44] root INFO: name : Cosine [2021/05/26 18:21:44] root INFO: name : Adam [2021/05/26 18:21:44] root INFO: regularizer : [2021/05/26 18:21:44] root INFO: factor : 1e-05 [2021/05/26 18:21:44] root INFO: name : L2 [2021/05/26 18:21:44] root INFO: PostProcess : [2021/05/26 18:21:44] root INFO: name : CTCLabelDecode [2021/05/26 18:21:44] root INFO: Train : [2021/05/26 18:21:44] root INFO: dataset : [2021/05/26 18:21:44] root INFO: data_dir : /ssd/lost+found/data/ppocr/训练数据集/TrainImages/ [2021/05/26 18:21:44] root INFO: label_file_list : ['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] [2021/05/26 18:21:44] root INFO: name : SimpleDataSet [2021/05/26 18:21:44] root INFO: transforms : [2021/05/26 18:21:44] root INFO: DecodeImage : [2021/05/26 18:21:44] root INFO: channel_first : False [2021/05/26 18:21:44] root INFO: img_mode : BGR [2021/05/26 18:21:44] root INFO: RecAug : None [2021/05/26 18:21:44] root INFO: CTCLabelEncode : None [2021/05/26 18:21:44] root INFO: RecResizeImg : [2021/05/26 18:21:44] root INFO: image_shape : [3, 32, 320] [2021/05/26 18:21:44] root INFO: KeepKeys : [2021/05/26 18:21:44] root INFO: keep_keys : ['image', 'label', 'length'] [2021/05/26 18:21:44] root INFO: loader : [2021/05/26 18:21:44] root INFO: batch_size_per_card : 256 [2021/05/26 18:21:44] root INFO: drop_last : True [2021/05/26 18:21:44] root INFO: num_workers : 0 [2021/05/26 18:21:44] root INFO: shuffle : True [2021/05/26 18:21:44] root INFO: train with paddle 2.0.2 and device CUDAPlace(0) [2021/05/26 18:21:44] root INFO: Initialize indexs of datasets:['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] [2021/05/26 18:21:45] root INFO: Initialize indexs of datasets:['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] W0526 18:21:45.201925 29486 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.2, Runtime API Version: 10.1 W0526 18:21:45.215572 29486 device_context.cc:372] device: 0, cuDNN Version: 7.6. /root/anaconda3/envs/pp/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning:
np.int
is a deprecated alias for the builtinint
. To silence this warning, useint
by itself. Doing this will not modify any behavior and is safe. When replacingnp.int
, you may wish to use e.g.np.int64
ornp.int32
to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def convert_to_list(value, n, name, dtype=np.int): /root/anaconda3/envs/pp/lib/python3.7/site-packages/skimage/morphology/skeletonize.py:241: DeprecationWarning:np.bool
is a deprecated alias for the builtinbool
. To silence this warning, usebool
by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.boolhere. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations 0, 1, 1, 0, 0, 1, 0, 0, 0], dtype=np.bool) /root/anaconda3/envs/pp/lib/python3.7/site-packages/skimage/morphology/_skeletonize.py:256: DeprecationWarning:
np.boolis a deprecated alias for the builtin
bool. To silence this warning, use
boolby itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use
np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=np.bool) 2021-05-26 18:21:53,201 - INFO - If regularizer of a Parameter has been set by 'paddle.ParamAttr' or 'static.WeightNormParamAttr' already. The weight_decay[L2Decay, regularization_coeff=0.000010] in Optimizer will not take effect, and it will only be applied to other Parameters! [2021/05/26 18:21:53] root INFO: train from scratch [2021/05/26 18:21:53] root INFO: train dataloader has 390 iters [2021/05/26 18:21:53] root INFO: valid dataloader has 391 iters [2021/05/26 18:21:53] root INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations [2021/05/26 18:21:53] root INFO: Initialize indexs of datasets:['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] [2021/05/26 18:22:30] root INFO: epoch: [1/500], iter: 10, lr: 0.001000, loss: 613.501282, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 1.13828 s, batch_cost: 2.78723 s, samples: 2816, ips: 101.03228 [2021/05/26 18:23:00] root INFO: epoch: [1/500], iter: 20, lr: 0.001000, loss: 491.363159, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.80013 s, batch_cost: 2.13014 s, samples: 2560, ips: 120.17969 [2021/05/26 18:23:30] root INFO: epoch: [1/500], iter: 30, lr: 0.001000, loss: 249.973083, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.81321 s, batch_cost: 2.21799 s, samples: 2560, ips: 115.41979 [2021/05/26 18:24:02] root INFO: epoch: [1/500], iter: 40, lr: 0.001000, loss: 86.003624, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.95516 s, batch_cost: 2.33469 s, samples: 2560, ips: 109.65040 [2021/05/26 18:24:33] root INFO: epoch: [1/500], iter: 50, lr: 0.001000, loss: 48.453705, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.89787 s, batch_cost: 2.22190 s, samples: 2560, ips: 115.21647 [2021/05/26 18:25:03] root INFO: epoch: [1/500], iter: 60, lr: 0.001000, loss: 44.564922, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.91991 s, batch_cost: 2.25347 s, samples: 2560, ips: 113.60247 …………………… [2021/05/27 08:59:56] root INFO: epoch: [43/500], iter: 16580, lr: 0.000982, loss: 34.535393, acc: 0.003906, norm_edit_dis: 0.023424, reader_cost: 0.80111 s, batch_cost: 2.13079 s, samples: 2560, ips: 120.14346 [2021/05/27 09:00:26] root INFO: epoch: [43/500], iter: 16590, lr: 0.000982, loss: 33.818520, acc: 0.005859, norm_edit_dis: 0.022587, reader_cost: 0.90223 s, batch_cost: 2.18736 s, samples: 2560, ips: 117.03613 [2021/05/27 09:00:55] root INFO: epoch: [43/500], iter: 16600, lr: 0.000982, loss: 34.028244, acc: 0.003906, norm_edit_dis: 0.023617, reader_cost: 0.77060 s, batch_cost: 2.06542 s, samples: 2560, ips: 123.94556 [2021/05/27 09:01:25] root INFO: epoch: [43/500], iter: 16610, lr: 0.000982, loss: 34.419975, acc: 0.003906, norm_edit_dis: 0.022949, reader_cost: 0.88083 s, batch_cost: 2.18148 s, samples: 2560, ips: 117.35151 [2021/05/27 09:01:55] root INFO: epoch: [43/500], iter: 16620, lr: 0.000982, loss: 34.603676, acc: 0.003906, norm_edit_dis: 0.022959, reader_cost: 0.91563 s, batch_cost: 2.20875 s, samples: 2560, ips: 115.90263 [2021/05/27 09:02:26] root INFO: epoch: [43/500], iter: 16630, lr: 0.000982, loss: 35.354607, acc: 0.005859, norm_edit_dis: 0.022959, reader_cost: 0.89960 s, batch_cost: 2.21324 s, samples: 2560, ips: 115.66758 [2021/05/27 09:02:55] root INFO: epoch: [43/500], iter: 16640, lr: 0.000982, loss: 34.920311, acc: 0.007812, norm_edit_dis: 0.023994, reader_cost: 0.81060 s, batch_cost: 2.10501 s, samples: 2560, ips: 121.61464 [2021/05/27 09:03:26] root INFO: epoch: [43/500], iter: 16650, lr: 0.000982, loss: 34.196205, acc: 0.007812, norm_edit_dis: 0.023608, reader_cost: 1.02603 s, batch_cost: 2.33794 s, samples: 2560, ips: 109.49800 [2021/05/27 09:03:57] root INFO: epoch: [43/500], iter: 16660, lr: 0.000982, loss: 34.853157, acc: 0.003906, norm_edit_dis: 0.021517, reader_cost: 0.93489 s, batch_cost: 2.25111 s, samples: 2560, ips: 113.72191 [2021/05/27 09:04:28] root INFO: epoch: [43/500], iter: 16670, lr: 0.000982, loss: 34.970486, acc: 0.003906, norm_edit_dis: 0.021613, reader_cost: 0.95125 s, batch_cost: 2.24483 s, samples: 2560, ips: 114.03992 [2021/05/27 09:04:58] root INFO: epoch: [43/500], iter: 16680, lr: 0.000982, loss: 34.865288, acc: 0.003906, norm_edit_dis: 0.021822, reader_cost: 0.97825 s, batch_cost: 2.26213 s, samples: 2560, ips: 113.16753 [2021/05/27 09:05:31] root INFO: epoch: [43/500], iter: 16690, lr: 0.000982, loss: 34.931526, acc: 0.003906, norm_edit_dis: 0.023963, reader_cost: 1.03104 s, batch_cost: 2.39247 s, samples: 2560, ips: 107.00237已经试了五六次了,paddle环境也重新装过,一直都训练不起来。期待帮忙解答,谢谢!