Open wuzuowuyou opened 4 years ago
包版本如下:
_libgcc_mutex 0.1 main
anyconfig 0.9.10
blas 1.0 mkl
ca-certificates 2019.11.27 0
certifi 2019.11.28 py37_0
cffi 1.13.2 py37h2e261b9_0
Click 7.0
cycler 0.10.0
editdistance 0.5.3
gevent 1.4.0
ipython 7.11.1 py37h39e3cac_0
ipython_genutils 0.2.0 py37_0
itsdangerous 1.1.0
Jinja2 2.10.3
kiwisolver 1.1.0
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_0
MarkupSafe 1.1.1
mkl-service 2.3.0 py37he904b0f_0
mkl_fft 1.0.15 py37ha843d7b_0
mkl_random 1.1.0 py37hd6b4f25_0
munch 2.5.0
networkx 2.4
numpy 1.18.1
numpy-base 1.17.4 py37hde5b4d6_0
olefile 0.46 py_0
opencv-python 4.1.2.30
parso 0.5.2 py_0
pexpect 4.7.0 py37_0
pickleshare 0.7.5 py37_0
pillow 7.0.0 py37hb39fc2d_0
pip 19.3.1 py37_0
prompt_toolkit 3.0.2 py_0
protobuf 3.11.2
pyclipper 1.1.0.post3
pygments 2.5.2 py_0
pyparsing 2.4.6
python-dateutil 2.8.1
scikit-image 0.16.2
Shapely 1.6.4.post2
sortedcontainers 2.1.0
tensorboardX 2.0
torchvision 0.4.2 py37_cu100 pytorch
tqdm 4.41.1
wcwidth 0.1.7 py37_0
Werkzeug 0.16.0
xz 5.2.4 h14c3975_4
zlib 1.2.11 h7b6447c_3
zstd 1.3.7 h0b5b093_0
@wuzuowuyou 我觉得可以从以下几个方面去排查:
1. gt是否是对的 (标签对的,我用的是你网盘的标签,图片是原图) 2. 训练一张图片是否能拟合,即训练和测试用同一张图判断是否能拟合 (不能) 3. 排查是否是cuda和pytorch版本的问题(pytorch 1.3.1 py3.7_cuda10.0.130_cudnn7.6.3_0 pytorch)
现在硬件是gtx1080 cuda10,pytorch1.3,具体版本如上: 训练几张图片还是不能收敛:
[INFO] [2020-01-21 16:04:34,219] step: 980, epoch: 980, loss: 5.026607, lr: 0.001526 [INFO] [2020-01-21 16:04:34,221] bce_loss: 0.587644 [INFO] [2020-01-21 16:04:34,221] thresh_loss: 0.911169 [INFO] [2020-01-21 16:04:34,222] l1_loss: 0.117722 [INFO] [2020-01-21 16:04:34,244] Training epoch 981 [INFO] [2020-01-21 16:04:35,283] Training epoch 982 [INFO] [2020-01-21 16:04:36,300] Training epoch 983 [INFO] [2020-01-21 16:04:37,314] Training epoch 984 [INFO] [2020-01-21 16:04:38,304] Training epoch 985 [INFO] [2020-01-21 16:04:39,337] Training epoch 986 [INFO] [2020-01-21 16:04:40,342] Training epoch 987 [INFO] [2020-01-21 16:04:41,328] Training epoch 988 [INFO] [2020-01-21 16:04:42,338] Training epoch 989 [INFO] [2020-01-21 16:04:43,341] Training epoch 990 [INFO] [2020-01-21 16:04:44,345] Training epoch 991 [INFO] [2020-01-21 16:04:45,365] Training epoch 992 [INFO] [2020-01-21 16:04:46,377] Training epoch 993 [INFO] [2020-01-21 16:04:47,380] Training epoch 994 [INFO] [2020-01-21 16:04:48,377] Training epoch 995 [INFO] [2020-01-21 16:04:49,394] Training epoch 996 [INFO] [2020-01-21 16:04:50,407] Training epoch 997 [INFO] [2020-01-21 16:04:51,419] Training epoch 998 [INFO] [2020-01-21 16:04:52,433] Training epoch 999 [INFO] [2020-01-21 16:04:53,451] Training epoch 1000 [INFO] [2020-01-21 16:04:54,537] step: 1000, epoch: 1000, loss: 4.633596, lr: 0.001401
预测出来也全没有东西,数据集是ic15
解决了吗 遇到了类似的问题
我也没法收敛
遇到同样的问题,请问解决了吗
学习率也要相应调小4倍才行,而且要在training/learning.py里改
楼主,您好,我也也到同样的问题,训练了328个epoch之后,我的loss能够收敛到1.2, 训练过程中,bec_loss,thresh_loss,l1_loss收敛情况都很好,但是 测试的时候发现什么都检测到不到,我用github上的模型都可以检测到很多文本 很郁闷,我的标签没有错,
[INFO] [2020-05-15 08:29:36,398] step: 85050, epoch: 328, loss: 1.249597, lr: 0.005253
[INFO] [2020-05-15 08:29:36,500] bce_loss: 0.149779
[INFO] [2020-05-15 08:29:36,599] thresh_loss: 0.132163
[INFO] [2020-05-15 08:29:36,601] l1_loss: 0.036854
想问楼主解决了吗?如何解决的
楼主,您好,我也也到同样的问题,训练了328个epoch之后,我的loss能够收敛到1.2, 训练过程中,bec_loss,thresh_loss,l1_loss收敛情况都很好,但是 测试的时候发现什么都检测到不到,我用github上的模型都可以检测到很多文本 很郁闷,我的标签没有错,
[INFO] [2020-05-15 08:29:36,398] step: 85050, epoch: 328, loss: 1.249597, lr: 0.005253 [INFO] [2020-05-15 08:29:36,500] bce_loss: 0.149779 [INFO] [2020-05-15 08:29:36,599] thresh_loss: 0.132163 [INFO] [2020-05-15 08:29:36,601] l1_loss: 0.036854
想问楼主解决了吗?如何解决的
你这个loss1点多,感觉可以了啊,在测试的时候把阈值放小点儿试试
楼主解决问题了吗?我的ic15数据集也不收敛,效果一塌糊涂。loss一直都是3.3左右,降不下去。ps: 没有pretrain。
我也遇到这情况了= =
持续关注该问题,不知道咋解决的
有人解决了吗 没有pretrain的效果如何
作者,您好,感谢您的开源。 由于我是8G显存gtx1080卡,在ic15_resnet50_deform_thre.yaml中:主要就改了batch_size (4)和num_worker(4),这样在8G卡上才可以训练。如下: train: class: TrainSettings data_loader: class: DataLoader dataset: ^train_data batch_size: 4 num_workers: 4 checkpoint: class: Checkpoint start_epoch: 0 start_iter: 0 resume: null model_saver: class: ModelSaver dir_path: model save_interval: 1000 signal_path: save scheduler: class: OptimizerScheduler optimizer: "SGD" optimizer_args: lr: 0.007 momentum: 0.9 weight_decay: 0.0001 learning_rate:
class: DecayLearningRate epochs: 1200 epochs: 1200
训练命令是: CUDA_VISIBLE_DEVICES=0 python train.py /data_1/DB/20200120/DB-master/experiments/seg_detector/ic15_resnet50_deform_thre.yaml --num_gpus 1 --resume /data_1/0project/DB/20200120/0121_2/DB-master/myfile/download/models/pre-trained-model-synthtext。
训练了几个小时还是不收敛,log如下: [INFO] [2020-01-21 14:33:49,926] step: 13840, epoch: 55, loss: 4.161959, lr: 0.006711 [INFO] [2020-01-21 14:33:49,927] bce_loss: 0.530029 [INFO] [2020-01-21 14:33:49,927] thresh_loss: 0.627000 [INFO] [2020-01-21 14:33:49,928] l1_loss: 0.088481 [INFO] [2020-01-21 14:34:03,791] step: 13860, epoch: 55, loss: 3.328142, lr: 0.006711 [INFO] [2020-01-21 14:34:03,792] bce_loss: 0.436986 [INFO] [2020-01-21 14:34:03,792] thresh_loss: 0.378803 [INFO] [2020-01-21 14:34:03,793] l1_loss: 0.076441 [INFO] [2020-01-21 14:34:17,788] step: 13880, epoch: 55, loss: 7.257392, lr: 0.006711 [INFO] [2020-01-21 14:34:17,789] bce_loss: 1.107980 [INFO] [2020-01-21 14:34:17,789] thresh_loss: 0.748738 [INFO] [2020-01-21 14:34:17,789] l1_loss: 0.096875 [INFO] [2020-01-21 14:34:31,381] step: 13900, epoch: 55, loss: 3.780116, lr: 0.006711 [INFO] [2020-01-21 14:34:31,382] bce_loss: 0.469818 [INFO] [2020-01-21 14:34:31,382] thresh_loss: 0.512659 [INFO] [2020-01-21 14:34:31,383] l1_loss: 0.091837 [INFO] [2020-01-21 14:34:44,875] step: 13920, epoch: 55, loss: 4.576564, lr: 0.006711 [INFO] [2020-01-21 14:34:44,876] bce_loss: 0.556754 [INFO] [2020-01-21 14:34:44,877] thresh_loss: 0.768575 [INFO] [2020-01-21 14:34:44,877] l1_loss: 0.102422 [INFO] [2020-01-21 14:34:58,382] step: 13940, epoch: 55, loss: 4.240158, lr: 0.006711 [INFO] [2020-01-21 14:34:58,383] bce_loss: 0.516656 [INFO] [2020-01-21 14:34:58,383] thresh_loss: 0.712895 [INFO] [2020-01-21 14:34:58,384] l1_loss: 0.094398 [INFO] [2020-01-21 14:35:12,208] step: 13960, epoch: 55, loss: 4.540379, lr: 0.006711 [INFO] [2020-01-21 14:35:12,209] bce_loss: 0.552384 [INFO] [2020-01-21 14:35:12,210] thresh_loss: 0.881431 [INFO] [2020-01-21 14:35:12,210] l1_loss: 0.089703 [INFO] [2020-01-21 14:35:25,857] step: 13980, epoch: 55, loss: 3.018045, lr: 0.006711 [INFO] [2020-01-21 14:35:25,857] bce_loss: 0.397264 [INFO] [2020-01-21 14:35:25,858] thresh_loss: 0.280025 [INFO] [2020-01-21 14:35:25,858] l1_loss: 0.075170 [INFO] [2020-01-21 14:35:38,653] Training epoch 56 0 [INFO] [2020-01-21 14:35:40,025] step: 14000, epoch: 56, loss: 3.810111, lr: 0.006706 [INFO] [2020-01-21 14:35:40,026] bce_loss: 0.476753 [INFO] [2020-01-21 14:35:40,026] thresh_loss: 0.509933 [INFO] [2020-01-21 14:35:40,027] l1_loss: 0.091641 [INFO] [2020-01-21 14:35:54,017] step: 14020, epoch: 56, loss: 3.557333, lr: 0.006706 [INFO] [2020-01-21 14:35:54,018] bce_loss: 0.459844 [INFO] [2020-01-21 14:35:54,018] thresh_loss: 0.417887 [INFO] [2020-01-21 14:35:54,019] l1_loss: 0.084023
是训练时间太短吗?可是我训了一个晚上的第二天来还是loss在4左右。 是batch_size太小导致的不收敛吗?这个没有实验,没有多卡机器不好实验。。。 迷茫了很多天了,issue里面的问题几乎每条都看了。。持续迷茫ing,期待您的回复。谢谢~