Why does an epoch stop early and go straight to the next epoch？

ZhengPeng7 / GCoNet_plus

[TPAMI'23] GCoNet+: A Stronger Group Collaborative Co-Salient Object Detector.

https://huggingface.co/spaces/ZhengPeng7/GCoNet_plus_demo

MIT License

34 stars 6 forks source link

Why does an epoch stop early and go straight to the next epoch？ #5

Closed IceHowe closed 2 years ago

IceHowe commented 2 years ago

Why does an epoch stop early and go straight to the next epoch？（中文：为什么一个epoch提前停止并直接进入了下一个epoch） I didn't notice this situation when I was training before. Later, it happened when I changed to multi-GPU training. I thought it was a problem with the multi-GPU training I changed, but I re-downloaded the code for single-GPU training and still had this problem: Each epoch ends early and a new epoch is performed Is this correct? Or that it was （中文：之前我训练时并没有注意到有没有出现这个状况，后来我改成多GPU训练时出现了，我以为是改的多GPU训练有问题，可是我重新下了代码单GPU训练还是有这个问题：每个epoch提前结束，进行新的epoch 这是正确的吗？还是说原本就这样）

ZhengPeng7 commented 2 years ago

Hi, that's caused by the different lengths of the data loaders of DUTS_class and COCO-SEG/COCO9213 datasets. 291 is the class number of images in the DUTS_class dataset. So, you don't need to worry about that, the loading is on these two whole datasets. If you set only the DUTS_class as the training set, you will see it ranges from 0/291 to 280/291.

(您表达的没问题的, 再写一遍中文就客气啦. 还有问题欢迎继续留言.)

IceHowe commented 2 years ago

谢谢，我想过这个问题，不过还没来得及测试还有个问题是我怎么训练都达不到论文中的结果，下面是我的结果 mmexport1662366764795 第一张图是用一张gpu训练的的，第二张图是两张卡用DataParallel的方式训练的，训练了2000个epoch，之前也试过350和500，但都差不多，比论文差，我只改了batchsize，因为2080ti显存不够，我最大只能设置6，看到论文里是26，后来用DataParallel的方式设置了12，效果也差好多

ZhengPeng7 commented 2 years ago

这个你用2080Ti训, batch size只有原先的1/4还不到, performance肯定要大打折扣啊... 本来一般的deep-learning训练都可能受batch size的影响, 更加上CoSOD这个Co的部分肯定是需要更多的样本来在线挖掘共识的, 所以其实我觉得bs=6能到这个performance算是合理的水准了. 关于DataParallel的方式, 这个方式其实本来就是不能等效于扩大batchsize的, distributed的DDP会更接近等效. 但我还是推荐你用一个大点的GPU试一下, 像原文中的V100类似的. 我之前是统计过DUTS_class里每个class样本数量的, 大概bs=32能覆盖到70%的class, 设太大了也没什么意义(我试过在一个80G的A100上, 设到无限大也没什么提升), 但是你设得太小肯定是会影响训练效果的哈.

IceHowe commented 2 years ago

好的，十分感谢，多卡的方式我现在用DDP的重新训练了下，之后看看结果，不过实在没有卡，只能先在目前这上面做了，再次感谢耐心回答

ZhengPeng7 commented 2 years ago

没事, 或者你用loss累加多次再backward一次的方式变相增加下batch size, 这个改起来最简单, 我随便找的样例: link.

IceHowe commented 2 years ago

没事, 或者你用loss累加多次再backward一次的方式变相增加下batch size, 这个改起来最简单, 我随便找的样例: link.

好的，谢谢，等这次DDP的训练完了我加上试试效果