ZhengPeng7 / GCoNet_plus

[TPAMI'23] GCoNet+: A Stronger Group Collaborative Co-Salient Object Detector.
https://huggingface.co/spaces/ZhengPeng7/GCoNet_plus_demo
MIT License
34 stars 6 forks source link

Why does an epoch stop early and go straight to the next epoch? #5

Closed IceHowe closed 2 years ago

IceHowe commented 2 years ago

Why does an epoch stop early and go straight to the next epoch? (中文:为什么一个epoch提前停止并直接进入了下一个epoch) image I didn't notice this situation when I was training before. Later, it happened when I changed to multi-GPU training. I thought it was a problem with the multi-GPU training I changed, but I re-downloaded the code for single-GPU training and still had this problem: Each epoch ends early and a new epoch is performed Is this correct? Or that it was (中文:之前我训练时并没有注意到有没有出现这个状况,后来我改成多GPU训练时出现了,我以为是改的多GPU训练有问题,可是我重新下了代码单GPU训练还是有这个问题: 每个epoch提前结束,进行新的epoch 这是正确的吗?还是说原本就这样)

ZhengPeng7 commented 2 years ago

Hi, that's caused by the different lengths of the data loaders of DUTS_class and COCO-SEG/COCO9213 datasets. 291 is the class number of images in the DUTS_class dataset. So, you don't need to worry about that, the loading is on these two whole datasets. If you set only the DUTS_class as the training set, you will see it ranges from 0/291 to 280/291.

(您表达的没问题的, 再写一遍中文就客气啦. 还有问题欢迎继续留言.)

IceHowe commented 2 years ago

谢谢,我想过这个问题,不过还没来得及测试 还有个问题是我怎么训练都达不到论文中的结果,下面是我的结果 单卡 mmexport1662366764795 第一张图是用一张gpu训练的的,第二张图是两张卡用DataParallel的方式训练的,训练了2000个epoch,之前也试过350和500,但都差不多,比论文差,我只改了batchsize,因为2080ti显存不够,我最大只能设置6,看到论文里是26,后来用DataParallel的方式设置了12,效果也差好多 image

ZhengPeng7 commented 2 years ago

这个你用2080Ti训, batch size只有原先的1/4还不到, performance肯定要大打折扣啊... 本来一般的deep-learning训练都可能受batch size的影响, 更加上CoSOD这个Co的部分肯定是需要更多的样本来在线挖掘共识的, 所以其实我觉得bs=6能到这个performance算是合理的水准了. 关于DataParallel的方式, 这个方式其实本来就是不能等效于扩大batchsize的, distributed的DDP会更接近等效. 但我还是推荐你用一个大点的GPU试一下, 像原文中的V100类似的. 我之前是统计过DUTS_class里每个class样本数量的, 大概bs=32能覆盖到70%的class, 设太大了也没什么意义(我试过在一个80G的A100上, 设到无限大也没什么提升), 但是你设得太小肯定是会影响训练效果的哈.

IceHowe commented 2 years ago

好的,十分感谢,多卡的方式我现在用DDP的重新训练了下,之后看看结果,不过实在没有卡,只能先在目前这上面做了,再次感谢耐心回答

ZhengPeng7 commented 2 years ago

没事, 或者你用loss累加多次再backward一次的方式变相增加下batch size, 这个改起来最简单, 我随便找的样例: link.

IceHowe commented 2 years ago

没事, 或者你用loss累加多次再backward一次的方式变相增加下batch size, 这个改起来最简单, 我随便找的样例: link.

好的,谢谢,等这次DDP的训练完了我加上试试效果