XPixelGroup / HAT

CVPR2023 - Activating More Pixels in Image Super-Resolution Transformer Arxiv - HAT: Hybrid Attention Transformer for Image Restoration
Apache License 2.0
1.14k stars 134 forks source link

训练iteration数? #114

Closed yumath closed 7 months ago

yumath commented 7 months ago

options里设置的是50万iter,batch size为4的话,basicsr会把总iter数转为epoch数进行训练。https://github.com/XPixelGroup/BasicSR/blob/033cd6896d898fdd3dcda32e3102a792efa1b8f4/basicsr/train.py#L48

以DF2K数据集为例: Training statistics: Number of train images: 144147 Dataset enlarge ratio: 1 Batch size per gpu: 4 World size (gpu number): 1 Require iter number per epoch: 36037 Total epochs: 14; iters: 500000.

但是训完14个epoch时,iter数只到了135,100,且eta: 3 days,远远没到50万iter

[train..][epoch: 14, iter: 135,000, lr:(2.000e-04,)] [eta: 3 days, 16:37:09, time (data): 0.863 (0.002)] l_pix: 1.3308e-02 [train..][epoch: 14, iter: 135,100, lr:(2.000e-04,)] [eta: 3 days, 16:35:37, time (data): 0.861 (0.002)] l_pix: 1.2130e-02 End of training. Time consumed: xxx Save the latest model.

我这种情况是训完成了吗?还是没到50万iter? 现在训出来的结果,也跟paper report的差异很大

yumath commented 7 months ago

简单点说,就是预设了训练50万iter,但只训了13万5千就结束了。这是正常的吗?

yumath commented 7 months ago

是需要调大dataset_enlarge_ratio吗

yumath commented 7 months ago

https://github.com/XPixelGroup/HAT/issues/26#issuecomment-1288154862 set small batch size https://github.com/XPixelGroup/HAT/blob/1b22ba0aff82d9d041f5bfa763f82649e6c23d99/options/train/train_HAT_SRx2_from_scratch.yml#L26 should I set batch_size to 1?

yumath commented 7 months ago

Ok, I found the bug. https://github.com/XPixelGroup/HAT#how-to-train says distributed call hat/train.py, but I have not. so opt['world_size'] be setting to 1, cause mismatch between iters and epochs, insufficient training.