训练iteration数？ - Githubissues

yumath commented 7 months ago

options里设置的是50万iter，batch size为4的话，basicsr会把总iter数转为epoch数进行训练。https://github.com/XPixelGroup/BasicSR/blob/033cd6896d898fdd3dcda32e3102a792efa1b8f4/basicsr/train.py#L48

以DF2K数据集为例： Training statistics: Number of train images: 144147 Dataset enlarge ratio: 1 Batch size per gpu: 4 World size (gpu number): 1 Require iter number per epoch: 36037 Total epochs: 14; iters: 500000.

但是训完14个epoch时，iter数只到了135,100，且eta: 3 days，远远没到50万iter

[train..][epoch: 14, iter: 135,000, lr:(2.000e-04,)] [eta: 3 days, 16:37:09, time (data): 0.863 (0.002)] l_pix: 1.3308e-02 [train..][epoch: 14, iter: 135,100, lr:(2.000e-04,)] [eta: 3 days, 16:35:37, time (data): 0.861 (0.002)] l_pix: 1.2130e-02 End of training. Time consumed: xxx Save the latest model.

我这种情况是训完成了吗？还是没到50万iter？现在训出来的结果，也跟paper report的差异很大

yumath commented 7 months ago

简单点说，就是预设了训练50万iter，但只训了13万5千就结束了。这是正常的吗？

yumath commented 7 months ago

是需要调大dataset_enlarge_ratio吗

yumath commented 7 months ago

https://github.com/XPixelGroup/HAT/issues/26#issuecomment-1288154862 set small batch size https://github.com/XPixelGroup/HAT/blob/1b22ba0aff82d9d041f5bfa763f82649e6c23d99/options/train/train_HAT_SRx2_from_scratch.yml#L26 should I set batch_size to 1?

yumath commented 7 months ago

Ok, I found the bug. https://github.com/XPixelGroup/HAT#how-to-train says distributed call hat/train.py, but I have not. so opt['world_size'] be setting to 1, cause mismatch between iters and epochs, insufficient training.

XPixelGroup / HAT

训练iteration数？ #114