strange loss curve - Githubissues

argman commented 7 years ago

Thanks for the clean and elegant code! I tried to run training from scratch (use pretrained vgg_16 model on imagenet), the traning process looks weird.

Total Loss qq 20170907234749

And the corresponding loss for others. qq 20170907234947

the loss quickly converged to about 10+, and I test the model, but no text boxes is detected, how can I diagnose this?

BowieHsu commented 7 years ago

@argman have you converted the checkpoints from VGG16 FC reduced caffemodel? I used converted checkpoints and train from scratch on ICDAR2015 and it shows good results, the loss should converge to 2.0 more or less, you can see #4 to download my checkpoints

argman commented 7 years ago

@BowieHsu , thks! I will try, and will post my result here.

argman commented 7 years ago

@BowieHsu , btw, can you share your trained model ? As i am using tf-1.3, so need to check whether some changes in tf.

argman commented 7 years ago

@BowieHsu , after 6 hours of training using 4 gpus, the loss curve is snp20170911184828296

argman commented 7 years ago

@BowieHsu , thks for your model, i can get meaningful result now! The model is really hard to train..

BowieHsu commented 7 years ago

haha，it's really a good news

JiasiWang commented 7 years ago

@BowieHsu , hi, I used converted checkpoints and trained from scratch on ICDAR2015 but I got a bad result. I set the learning rate in json file like this: "max_steps": 90000, "base_lr": 1e-4, "lr_breakpoints": [10000, 20000, 60000, 75000, 90000], "lr_decay": [0.64, 0.8, 1.0, 0.1, 0.01], I guess maybe the base_lr is too samll or something else. Could you please show me your training strategy and the good results? Thank you so much!

BowieHsu commented 7 years ago

@JiasiWang Hi,wang, I'm also trained the model with default pretrain.json which shows good result,how about your batch size? or you may check loss value using tensorboard

JiasiWang commented 7 years ago

@BowieHsu , I did not change the batchsize, it is 32. I just changed the base_lr to 1e-4. I will check it, thanks

BowieHsu commented 7 years ago

@JiasiWang Yep, the default learning rate should be 5e-4.

BowieHsu commented 7 years ago

@JiasiWang By the way,the ICDAR2015 seglink model should pretrain on Synthtext datasets first, then finetune on ICDAR2015 train data sets if you want to reach 75% Hmean.

JiasiWang commented 7 years ago

@BowieHsu yeah, I know that seglink model need pretrain on Synthtext datasets. and without pretrain, I only get 58% Hmean. After that, I also pretrained the model as the paper showed, then fine-tune it, both steps I use the default json file, but it seems like that the loss did not converge in finetuning step.

Godricly commented 7 years ago

May I ask how to use your model? As I not familiar with tensorflow. I tried to load it in tensorflow 1.4, but I got following error. I did some search but no solution works for me.

i tried following solutions:

change seglink/sovler.py with

model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001')

set a folder with name VGG_ILSVRC_16_layers_ssd and passed its pass in json
set finetune_model value as VGG_ILSVRC_16_layers_ssd.ckpt, wich is a copy of VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001

Error log:

seglink/data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

BowieHsu commented 7 years ago

try "model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt)" @Godricly

Godricly commented 7 years ago

Many thanks! That saved my ass. :+1:

BowieHsu commented 7 years ago

@Godricly 不客气，道友

happycoding1996 commented 6 years ago

@BowieHsu 请问我如何利用您Pretrain的模型跳过批pretrain那一步呢？？请问exp/sgd/checkpoint里头是pretrain过程当中的模型吗？但是我将您的模型放进去他说formar不对

BowieHsu commented 6 years ago

@tianzhuotao pretrain的json文件是用来训练基于sythtext数据集的模型，如果你不想训练这个模型而是想直接训练基于icdar2015的模型的话 1.修改exp/sgd/finetune_ic15.json中的checkpoint_path为你放置的vgg模型的位置

输入 ./manager train exp/sgd finetune_ic15 就可以了

happycoding1996 commented 6 years ago

@BowieHsu 那个finetune的json文件里头只有一个finetune_model, 似乎EXP/SGD里头需要有一个checkpoint文件存在，但是我没有经过pretrain所以没有，您的模型里头似乎也只有3个文件，请问这个如何解决呢？

BowieHsu commented 6 years ago

你可以看到finetune.json文件中有两行 "resume": "finetune", "finetune_model": "../exp/sgd/checkpoint" 把这里的/exp/sgd/checkpoint替换成你放置的我转换的checkpoint就可以了，你可以注意看一下log信息，如果tensorflow找到了checkpoint但是依然报错，是因为这里的resume选项选的是finetune，有一些variable是在vgg模型中不存在的，所以你可能还需要把"resume":"finetune"改成"resume":"vgg16"，你可以先试一试

happycoding1996 commented 6 years ago

@BowieHsu 十分感谢!好人一生平安. 还解决了一些其他的问题(gpu什么的...)终于跑起来了

BowieHsu commented 6 years ago

@tianzhuotao 你可以关注一下训练的损失函数，如果是直接从vgg模型上来finetune的话，需要调整一下学习率，反正就慢慢调参吧，当然也需要根据实际的任务魔改代码，祝好运。

happycoding1996 commented 6 years ago

@BowieHsu 谢谢!我目前用的是默认参数,但是训练起来很慢,7个小时训练了6%,感觉很慢阿qwq 请问您训练大概用了多久呢? 我目前集群申请的16core cpu\1个gpu和32gb内存以及10g硬盘

19931991 commented 6 years ago

你好，我最近刚好也在研究多方向文字检测，可以加个qq交流一下吗？

@tianzhuotao @BowieHsu

13230380356 commented 6 years ago

你好，convert_caffemodel_to_ckpt.py 文件中import model_vgg16 这个model_vgg16需要用什么来装，装到哪里，还有运行run.sh 时报caffe的错误，网络说是python版本问题，需要换到python2.7，看您的介绍里是用的python3呀，能帮我解决一下疑惑吗

ZimingLu commented 6 years ago

@13230380356 我刚刚解决了pretrain的问题具体可以看外面#13我刚刚写的tips

HardSoft2023 commented 6 years ago

try "model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt)" @Godricly

everythin is OK until 2018-11-23 04:53:37,597 [INFO ] Restoring parameters from ../premodel/ILVSR_VGG_16_FC_REDUCED/VGG_ILSVRC_16_layers_ssd.ckpt Segmentation fault (core dumped how to debug?Segmentation fault (core dumped. every comment is welcome

Shualite commented 5 years ago

@BowieHsu @JiasiWang 我用了SynthText 40g做的tf文件，预训练90000轮以后，因为finetune_ic15.json里面"finetune_model": "../exp/sgd/checkpoint"（默认）跑不通，我改成了"finetune_model": "../exp/sgd/checkpoint-90000"。接下来训练10000轮以后。在ic15测试集上面跑出的结果只有 Recall | Precision | Hmean 59.56 % | 63.47 % | 61.45 %

为什么没有达到75%呢？道友盼回复，感谢大佬！

Shualite commented 5 years ago

改成batch-size32 依然hmean，61%左右。

Shualite commented 5 years ago

我拿预训练模型跑测试，不经过finetune，结果是hmean49%

gzpyunduan commented 3 years ago

我拿预训练模型跑测试，不经过finetune，结果是hmean49%

我跟你结果都一样，目前不知道该怎么优化了

bgshih / seglink

strange loss curve #3