Closed yumath closed 7 months ago
简单点说,就是预设了训练50万iter,但只训了13万5千就结束了。这是正常的吗?
是需要调大dataset_enlarge_ratio吗
https://github.com/XPixelGroup/HAT/issues/26#issuecomment-1288154862 set small batch size https://github.com/XPixelGroup/HAT/blob/1b22ba0aff82d9d041f5bfa763f82649e6c23d99/options/train/train_HAT_SRx2_from_scratch.yml#L26 should I set batch_size to 1?
Ok, I found the bug. https://github.com/XPixelGroup/HAT#how-to-train says distributed call hat/train.py, but I have not. so opt['world_size'] be setting to 1, cause mismatch between iters and epochs, insufficient training.
options里设置的是50万iter,batch size为4的话,basicsr会把总iter数转为epoch数进行训练。https://github.com/XPixelGroup/BasicSR/blob/033cd6896d898fdd3dcda32e3102a792efa1b8f4/basicsr/train.py#L48
以DF2K数据集为例: Training statistics: Number of train images: 144147 Dataset enlarge ratio: 1 Batch size per gpu: 4 World size (gpu number): 1 Require iter number per epoch: 36037 Total epochs: 14; iters: 500000.
但是训完14个epoch时,iter数只到了135,100,且eta: 3 days,远远没到50万iter
[train..][epoch: 14, iter: 135,000, lr:(2.000e-04,)] [eta: 3 days, 16:37:09, time (data): 0.863 (0.002)] l_pix: 1.3308e-02 [train..][epoch: 14, iter: 135,100, lr:(2.000e-04,)] [eta: 3 days, 16:35:37, time (data): 0.861 (0.002)] l_pix: 1.2130e-02 End of training. Time consumed: xxx Save the latest model.
我这种情况是训完成了吗?还是没到50万iter? 现在训出来的结果,也跟paper report的差异很大