About implementation details

yumath commented 1 year ago

Hi, @MingSun-Tse @yulunzhang ,we are all interesting in your this work, but I also meet reproduce problems. We using this code and training settings, but can't reproduce same results in paper or released checkpoint. same questions about implementation details below: https://github.com/MingSun-Tse/ASSL/issues/3#issue-1261458753 and https://github.com/MingSun-Tse/ASSL/issues/3#issuecomment-1202097388 and https://github.com/MingSun-Tse/ASSL/issues/3#issuecomment-1503240059

So, does something wrong in default training settings? Please disclose more training details about default ASSL like https://github.com/MingSun-Tse/ASSL/issues/3#issuecomment-1148625155, thanks very much!

MingSun-Tse commented 1 year ago

Hi @yumath , thanks for your interest in our work!

What are your training settings (batch size, patch size, #epochs, learning rate) of both the regularized training and finetuning stages?

yumath commented 1 year ago

We using training settings same as your setting: https://github.com/MingSun-Tse/ASSL#run in README.md

MingSun-Tse commented 1 year ago

Have you done sufficient finetuning after the pruning?
Could you post the training and finetuning hyper-params here?

yumath commented 1 year ago

Thanks for your reply. Yes, I have done 5000epoch finetune after pruning, and hyper-params below:

--model LEDSR --scale 2 --patch_size 96 \
    --ext sep --data_url data/ASSL/ --data_train DF2K --data_test Set5 --data_range 1-3550 \
    --chop --save_results --n_resblocks 16 --n_feats 256 \
    --method ASSL --wn --stage_pr [0-1000:0.80859375] --skip_layers *mean*,*tail* \
    --same_pruned_wg_layers model.head.0,model.body.16,*body.2 --reg_upper_limit 0.5 \
    --reg_granularity_prune 0.0001 --update_reg_interval 20 --stabilize_reg_interval 43150 \
    --pre_train ckpt/LEDSRx2_B16C256_8u128bs.pt --same_pruned_wg_criterion reg \
    --save ASSL_pruning/LEDSR_F256R16BIX2_DF2K_ASSL0.80859375_RGP0.0001_RUL0.5_Pretrain

MingSun-Tse commented 11 months ago

Hi @yumath , Okay, I see. The problem may be that, this script is not for the "sufficient finetuning" I meant. Although this script has a part of finetuning, as you may note, it is very short and the batch size & patch size are small. For the best performance, a heavier finetuning (according to the works of network pruning in classification) is recommended -- so after you have the pruned weights (using the scripts in README), try to use the following scripts to do the heavier finetuning:

2X: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --model LEDSR --n_resblocks 16 --n_feats 256 --scale 2 --patch_size 128 --batch_size 256 --ext bin --lr 8e-4 --chop --save_results --pre_train Experiments//log/model/model_latest.pt --n_GPUs 8 --project
3X: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --model LEDSR --n_resblocks 16 --n_feats 256 --scale 3 --patch_size 192 --batch_size 256 --ext bin --lr 8e-4 --chop --save_results --pre_train Experiments//log/model/model_latest.pt --n_GPUs 8 --project
4X: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --model LEDSR --n_resblocks 16 --n_feats 256 --scale 4 --patch_size 256 --batch_size 256 --ext bin --lr 8e-4 --chop --save_results --pre_train Experiments//log/model/model_latest.pt --n_GPUs 8 --project

Let me know if you have more questions. Tx!

yumath commented 11 months ago

@MingSun-Tse Thank you very much, I'll try it as soon as possible!

yumath commented 11 months ago

Hi @MingSun-Tse , thanks for your reply, I have tried heavier finetuning as your suggestion, and reproduced same results as in your paper. But I wonder, what is the meaning of network pruning? I can just reproduce the SOTA result in your paper by heavier finetune a checkpoint trained from scratch.	PSNR x2	Set5	Set14	B100	Urban100	Manga109
reported in your GASSL TPAMI 2023 w/o --self-ensemble	38.08	33.75	32.24	32.29	38.92
Heavier fine-tune a B16C49 which trained from scratch	38.07	33.71	32.24	32.26	38.95

MingSun-Tse commented 10 months ago

Hi @MingSun-Tse , thanks for your reply, I have tried heavier finetuning as your suggestion, and reproduced same results as in your paper. But I wonder, what is the meaning of network pruning? I can just reproduce the SOTA result in your paper by heavier finetune a checkpoint trained from scratch.

PSNR x2 Set5 Set14 B100 Urban100 Manga109 reported in your GASSL TPAMI 2023 w/o --self-ensemble 38.08 33.75 32.24 32.29 38.92 Heavier fine-tune a B16C49 which trained from scratch 38.07 33.71 32.24 32.26 38.95

Hi @yumath , thanks for the further feedback!

One potential problem with the comparison you presented is, GASSL uses non-uniform layerwise pruning ratio (i.e., its #channels is not C49), which actually has lower FLOPs/Params than the C49 model (see Tab. 3 in the TPAMI paper). So the comparison you presented may not be fair.
The scale x2 is quite small (too easy), so different methods show quite close performance. It might be better to also have x3 and x4 results.
Regarding the meaning of filter pruning, there is an ongoing discussion in the pruning community. I guess what you find is similar to the argument in this work.
- I personally believe (also agreed by pretty many researchers in the area, as far as I know), for structured pruning, given abundant finetuning epochs, there should be no significant difference between pruning vs. training from scratch, if these two schemes result in the network of the same architecture - i.e., their searching spaces are the same (esp. when the training strategy of scratch training turns better and better now).
- For SR, the finetuning is usually pretty heavy, so it may not be surprising when you finetune for many epochs, pruning and scratch training present similar performance.
- These said, pruning is meaningful in the sense that it provides better initial weights, which shortens the finetuning stage. Saving time is also a kind of value (when you rent AWS or GCP for training, more training time would cost you more money). This value now is more evident with the rise of foundation models.
- Pruning also has the value when the resulting architecture is unknown in advance, such as our GASSL TPAMI paper, as also argued in this work (they argue pruning can be seen as a kind of NAS to determine the final architecture).

MingSun-Tse / ASSL

About implementation details #8