vinceyzw commented 3 years ago

背景：在参加飞桨中文轻量级文字识别大赛，使用配置文件rec_chinese_lite_train_v2.0.yml，在ai-stidio可以正常训练，到24epoch左右acc有0.39左右；但是自己服务器上配置飞桨环境，训练了很多次，500epoch后acc只有0.005左右。同时，我又在自己服务器上试过rec_chinese_common_train_v2.0.yml，是可以正常训练的。

服务器环境： cuda10.1 paddlepaddle-gpu=2.0.2

下面列出自己服务器上重新训练的一次日志，目前到43个epoch，acc也只有0.0039 [2021/05/26 18:21:44] root INFO: Architecture : [2021/05/26 18:21:44] root INFO: Backbone : [2021/05/26 18:21:44] root INFO: model_name : small [2021/05/26 18:21:44] root INFO: name : MobileNetV3 [2021/05/26 18:21:44] root INFO: scale : 0.5 [2021/05/26 18:21:44] root INFO: small_stride : [1, 2, 2, 2] [2021/05/26 18:21:44] root INFO: Head : [2021/05/26 18:21:44] root INFO: fc_decay : 1e-05 [2021/05/26 18:21:44] root INFO: name : CTCHead [2021/05/26 18:21:44] root INFO: Neck : [2021/05/26 18:21:44] root INFO: encoder_type : rnn [2021/05/26 18:21:44] root INFO: hidden_size : 48 [2021/05/26 18:21:44] root INFO: name : SequenceEncoder [2021/05/26 18:21:44] root INFO: Transform : None [2021/05/26 18:21:44] root INFO: algorithm : CRNN [2021/05/26 18:21:44] root INFO: model_type : rec [2021/05/26 18:21:44] root INFO: Eval : [2021/05/26 18:21:44] root INFO: dataset : [2021/05/26 18:21:44] root INFO: data_dir : /ssd/lost+found/data/ppocr/训练数据集/TrainImages/ [2021/05/26 18:21:44] root INFO: label_file_list : ['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] [2021/05/26 18:21:44] root INFO: name : SimpleDataSet [2021/05/26 18:21:44] root INFO: transforms : [2021/05/26 18:21:44] root INFO: DecodeImage : [2021/05/26 18:21:44] root INFO: channel_first : False [2021/05/26 18:21:44] root INFO: img_mode : BGR [2021/05/26 18:21:44] root INFO: CTCLabelEncode : None [2021/05/26 18:21:44] root INFO: RecResizeImg : [2021/05/26 18:21:44] root INFO: image_shape : [3, 32, 320] [2021/05/26 18:21:44] root INFO: KeepKeys : [2021/05/26 18:21:44] root INFO: keep_keys : ['image', 'label', 'length'] [2021/05/26 18:21:44] root INFO: loader : [2021/05/26 18:21:44] root INFO: batch_size_per_card : 256 [2021/05/26 18:21:44] root INFO: drop_last : False [2021/05/26 18:21:44] root INFO: num_workers : 0 [2021/05/26 18:21:44] root INFO: shuffle : False [2021/05/26 18:21:44] root INFO: Global : [2021/05/26 18:21:44] root INFO: cal_metric_during_train : True [2021/05/26 18:21:44] root INFO: character_dict_path : ppocr/utils/ppocr_keys_v1.txt [2021/05/26 18:21:44] root INFO: character_type : ch [2021/05/26 18:21:44] root INFO: checkpoints : None [2021/05/26 18:21:44] root INFO: debug : False [2021/05/26 18:21:44] root INFO: distributed : False [2021/05/26 18:21:44] root INFO: epoch_num : 500 [2021/05/26 18:21:44] root INFO: eval_batch_step : [0, 2000] [2021/05/26 18:21:44] root INFO: infer_img : doc/imgs_words/ch/word_1.jpg [2021/05/26 18:21:44] root INFO: infer_mode : False [2021/05/26 18:21:44] root INFO: log_smooth_window : 20 [2021/05/26 18:21:44] root INFO: max_text_length : 25 [2021/05/26 18:21:44] root INFO: pretrained_model : None [2021/05/26 18:21:44] root INFO: print_batch_step : 10 [2021/05/26 18:21:44] root INFO: save_epoch_step : 3 [2021/05/26 18:21:44] root INFO: save_inference_dir : None [2021/05/26 18:21:44] root INFO: save_model_dir : ./output/rec_chinese_lite_v2.0 [2021/05/26 18:21:44] root INFO: save_res_path : ./output/rec/predicts_chinese_lite_v2.0.txt [2021/05/26 18:21:44] root INFO: use_gpu : True [2021/05/26 18:21:44] root INFO: use_space_char : True [2021/05/26 18:21:44] root INFO: use_visualdl : False [2021/05/26 18:21:44] root INFO: Loss : [2021/05/26 18:21:44] root INFO: name : CTCLoss [2021/05/26 18:21:44] root INFO: Metric : [2021/05/26 18:21:44] root INFO: main_indicator : acc [2021/05/26 18:21:44] root INFO: name : RecMetric [2021/05/26 18:21:44] root INFO: Optimizer : [2021/05/26 18:21:44] root INFO: beta1 : 0.9 [2021/05/26 18:21:44] root INFO: beta2 : 0.999 [2021/05/26 18:21:44] root INFO: lr : [2021/05/26 18:21:44] root INFO: learning_rate : 0.001 [2021/05/26 18:21:44] root INFO: name : Cosine [2021/05/26 18:21:44] root INFO: name : Adam [2021/05/26 18:21:44] root INFO: regularizer : [2021/05/26 18:21:44] root INFO: factor : 1e-05 [2021/05/26 18:21:44] root INFO: name : L2 [2021/05/26 18:21:44] root INFO: PostProcess : [2021/05/26 18:21:44] root INFO: name : CTCLabelDecode [2021/05/26 18:21:44] root INFO: Train : [2021/05/26 18:21:44] root INFO: dataset : [2021/05/26 18:21:44] root INFO: data_dir : /ssd/lost+found/data/ppocr/训练数据集/TrainImages/ [2021/05/26 18:21:44] root INFO: label_file_list : ['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] [2021/05/26 18:21:44] root INFO: name : SimpleDataSet [2021/05/26 18:21:44] root INFO: transforms : [2021/05/26 18:21:44] root INFO: DecodeImage : [2021/05/26 18:21:44] root INFO: channel_first : False [2021/05/26 18:21:44] root INFO: img_mode : BGR [2021/05/26 18:21:44] root INFO: RecAug : None [2021/05/26 18:21:44] root INFO: CTCLabelEncode : None [2021/05/26 18:21:44] root INFO: RecResizeImg : [2021/05/26 18:21:44] root INFO: image_shape : [3, 32, 320] [2021/05/26 18:21:44] root INFO: KeepKeys : [2021/05/26 18:21:44] root INFO: keep_keys : ['image', 'label', 'length'] [2021/05/26 18:21:44] root INFO: loader : [2021/05/26 18:21:44] root INFO: batch_size_per_card : 256 [2021/05/26 18:21:44] root INFO: drop_last : True [2021/05/26 18:21:44] root INFO: num_workers : 0 [2021/05/26 18:21:44] root INFO: shuffle : True [2021/05/26 18:21:44] root INFO: train with paddle 2.0.2 and device CUDAPlace(0) [2021/05/26 18:21:44] root INFO: Initialize indexs of datasets:['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] [2021/05/26 18:21:45] root INFO: Initialize indexs of datasets:['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] W0526 18:21:45.201925 29486 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.2, Runtime API Version: 10.1 W0526 18:21:45.215572 29486 device_context.cc:372] device: 0, cuDNN Version: 7.6. /root/anaconda3/envs/pp/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def convert_to_list(value, n, name, dtype=np.int): /root/anaconda3/envs/pp/lib/python3.7/site-packages/skimage/morphology/skeletonize.py:241: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.boolhere. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations 0, 1, 1, 0, 0, 1, 0, 0, 0], dtype=np.bool) /root/anaconda3/envs/pp/lib/python3.7/site-packages/skimage/morphology/_skeletonize.py:256: DeprecationWarning:np.boolis a deprecated alias for the builtinbool. To silence this warning, useboolby itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, usenp.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=np.bool) 2021-05-26 18:21:53,201 - INFO - If regularizer of a Parameter has been set by 'paddle.ParamAttr' or 'static.WeightNormParamAttr' already. The weight_decay[L2Decay, regularization_coeff=0.000010] in Optimizer will not take effect, and it will only be applied to other Parameters! [2021/05/26 18:21:53] root INFO: train from scratch [2021/05/26 18:21:53] root INFO: train dataloader has 390 iters [2021/05/26 18:21:53] root INFO: valid dataloader has 391 iters [2021/05/26 18:21:53] root INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations [2021/05/26 18:21:53] root INFO: Initialize indexs of datasets:['/ssd/lost+found/data/ppocr/训练数据集/LabelTrain.txt'] [2021/05/26 18:22:30] root INFO: epoch: [1/500], iter: 10, lr: 0.001000, loss: 613.501282, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 1.13828 s, batch_cost: 2.78723 s, samples: 2816, ips: 101.03228 [2021/05/26 18:23:00] root INFO: epoch: [1/500], iter: 20, lr: 0.001000, loss: 491.363159, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.80013 s, batch_cost: 2.13014 s, samples: 2560, ips: 120.17969 [2021/05/26 18:23:30] root INFO: epoch: [1/500], iter: 30, lr: 0.001000, loss: 249.973083, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.81321 s, batch_cost: 2.21799 s, samples: 2560, ips: 115.41979 [2021/05/26 18:24:02] root INFO: epoch: [1/500], iter: 40, lr: 0.001000, loss: 86.003624, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.95516 s, batch_cost: 2.33469 s, samples: 2560, ips: 109.65040 [2021/05/26 18:24:33] root INFO: epoch: [1/500], iter: 50, lr: 0.001000, loss: 48.453705, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.89787 s, batch_cost: 2.22190 s, samples: 2560, ips: 115.21647 [2021/05/26 18:25:03] root INFO: epoch: [1/500], iter: 60, lr: 0.001000, loss: 44.564922, acc: 0.000000, norm_edit_dis: 0.000000, reader_cost: 0.91991 s, batch_cost: 2.25347 s, samples: 2560, ips: 113.60247 …………………… [2021/05/27 08:59:56] root INFO: epoch: [43/500], iter: 16580, lr: 0.000982, loss: 34.535393, acc: 0.003906, norm_edit_dis: 0.023424, reader_cost: 0.80111 s, batch_cost: 2.13079 s, samples: 2560, ips: 120.14346 [2021/05/27 09:00:26] root INFO: epoch: [43/500], iter: 16590, lr: 0.000982, loss: 33.818520, acc: 0.005859, norm_edit_dis: 0.022587, reader_cost: 0.90223 s, batch_cost: 2.18736 s, samples: 2560, ips: 117.03613 [2021/05/27 09:00:55] root INFO: epoch: [43/500], iter: 16600, lr: 0.000982, loss: 34.028244, acc: 0.003906, norm_edit_dis: 0.023617, reader_cost: 0.77060 s, batch_cost: 2.06542 s, samples: 2560, ips: 123.94556 [2021/05/27 09:01:25] root INFO: epoch: [43/500], iter: 16610, lr: 0.000982, loss: 34.419975, acc: 0.003906, norm_edit_dis: 0.022949, reader_cost: 0.88083 s, batch_cost: 2.18148 s, samples: 2560, ips: 117.35151 [2021/05/27 09:01:55] root INFO: epoch: [43/500], iter: 16620, lr: 0.000982, loss: 34.603676, acc: 0.003906, norm_edit_dis: 0.022959, reader_cost: 0.91563 s, batch_cost: 2.20875 s, samples: 2560, ips: 115.90263 [2021/05/27 09:02:26] root INFO: epoch: [43/500], iter: 16630, lr: 0.000982, loss: 35.354607, acc: 0.005859, norm_edit_dis: 0.022959, reader_cost: 0.89960 s, batch_cost: 2.21324 s, samples: 2560, ips: 115.66758 [2021/05/27 09:02:55] root INFO: epoch: [43/500], iter: 16640, lr: 0.000982, loss: 34.920311, acc: 0.007812, norm_edit_dis: 0.023994, reader_cost: 0.81060 s, batch_cost: 2.10501 s, samples: 2560, ips: 121.61464 [2021/05/27 09:03:26] root INFO: epoch: [43/500], iter: 16650, lr: 0.000982, loss: 34.196205, acc: 0.007812, norm_edit_dis: 0.023608, reader_cost: 1.02603 s, batch_cost: 2.33794 s, samples: 2560, ips: 109.49800 [2021/05/27 09:03:57] root INFO: epoch: [43/500], iter: 16660, lr: 0.000982, loss: 34.853157, acc: 0.003906, norm_edit_dis: 0.021517, reader_cost: 0.93489 s, batch_cost: 2.25111 s, samples: 2560, ips: 113.72191 [2021/05/27 09:04:28] root INFO: epoch: [43/500], iter: 16670, lr: 0.000982, loss: 34.970486, acc: 0.003906, norm_edit_dis: 0.021613, reader_cost: 0.95125 s, batch_cost: 2.24483 s, samples: 2560, ips: 114.03992 [2021/05/27 09:04:58] root INFO: epoch: [43/500], iter: 16680, lr: 0.000982, loss: 34.865288, acc: 0.003906, norm_edit_dis: 0.021822, reader_cost: 0.97825 s, batch_cost: 2.26213 s, samples: 2560, ips: 113.16753 [2021/05/27 09:05:31] root INFO: epoch: [43/500], iter: 16690, lr: 0.000982, loss: 34.931526, acc: 0.003906, norm_edit_dis: 0.023963, reader_cost: 1.03104 s, batch_cost: 2.39247 s, samples: 2560, ips: 107.00237

已经试了五六次了，paddle环境也重新装过，一直都训练不起来。期待帮忙解答，谢谢！

LDOUBLEV commented 3 years ago

你的服务器GPU型号是？使用的PaddleOCR 版本是？

除了修改代码路径还修改了什么地方

你把aistudio的代码打包下载下来在你服务器上训练试试？

vinceyzw commented 3 years ago

你的服务器GPU型号是？使用的PaddleOCR 版本是？

除了修改代码路径还修改了什么地方

你把aistudio的代码打包下载下来在你服务器上训练试试？

GPU是T4 使用的paddleOCR版本：release/2.1 只修改了数据路径

vinceyzw commented 3 years ago

把aistudio的代码打包下载下来在自己服务器上训练也不行，我刚试过。 @LDOUBLEV

感觉还是环境的问题啊！我的conda环境如下：

packages in environment at /root/anaconda3/envs/pp:

#

Name Version Build Channel

_libgcc_mutex 0.1 main defaults appdirs 1.4.4 astor 0.8.1 Babel 2.9.1 bce-python-sdk 0.8.60 ca-certificates 2021.4.13 h06a4308_1 defaults certifi 2020.12.5 py37h06a4308_0 defaults cfgv 3.3.0 chardet 4.0.0 click 8.0.1 cycler 0.10.0 decorator 4.4.2 distlib 0.3.1 filelock 3.0.12 flake8 3.9.2 Flask 2.0.1 Flask-Babel 2.0.0 future 0.18.2 gast 0.4.0 identify 2.2.6 idna 2.10 imageio 2.9.0 imgaug 0.4.0 importlib-metadata 4.0.1 itsdangerous 2.0.1 Jinja2 3.0.1 kiwisolver 1.3.1 ld_impl_linux-64 2.33.1 h53a641e_7 defaults libffi 3.3 he6710b0_2 defaults libgcc-ng 9.1.0 hdf63c60_0 defaults libstdcxx-ng 9.1.0 hdf63c60_0 defaults lmdb 1.2.1 MarkupSafe 2.0.1 matplotlib 3.4.2 mccabe 0.6.1 ncurses 6.2 he6710b0_1 defaults networkx 2.5.1 nodeenv 1.6.0 numpy 1.20.3 opencv-contrib-python 4.2.0.32 opencv-python 4.5.2.52 openssl 1.1.1k h27cfd23_0 defaults paddlepaddle-gpu 2.0.2.post101 pandas 1.2.4 Pillow 8.2.0 pip 21.1.1 py37h06a4308_0 defaults pre-commit 2.13.0 protobuf 3.17.1 pyclipper 1.2.1 pycodestyle 2.7.0 pycryptodome 3.10.1 pyflakes 2.3.1 pyparsing 2.4.7 python 3.7.10 hdb3f193_0 defaults python-dateutil 2.8.1 python-Levenshtein 0.12.2 pytz 2021.1 PyWavelets 1.1.1 PyYAML 5.4.1 readline 8.1 h27cfd23_0 defaults requests 2.25.1 scikit-image 0.17.2 scipy 1.6.3 setuptools 52.0.0 py37h06a4308_0 defaults Shapely 1.7.1 shellcheck-py 0.7.2.1 six 1.16.0 sqlite 3.35.4 hdfb4753_0 defaults tifffile 2021.4.8 tk 8.6.10 hbc83047_0 defaults toml 0.10.2 tqdm 4.61.0 typing-extensions 3.10.0.0 urllib3 1.26.4 virtualenv 20.4.7 visualdl 2.2.0 Werkzeug 2.0.1 wheel 0.36.2 pyhd3eb1b0_0 defaults xz 5.2.5 h7b6447c_0 defaults zipp 3.4.1 zlib 1.2.11 0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free

@LDOUBLEV 大佬再帮忙看下啊

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

PaddlePaddle / PaddleOCR

在训练ocr模型，同样的config文件，ai-stidio可以正常训练，自己服务器训练一直准确率acc在0.001左右 #2931

packages in environment at /root/anaconda3/envs/pp:

Name Version Build Channel