Sense-X / UniFormer

[ICLR2022] official implementation of UniFormer
Apache License 2.0
812 stars 111 forks source link

a basic question about pretrained model #87

Closed go-ahead-maker closed 1 year ago

go-ahead-maker commented 1 year ago

Hi authors, nice work! I got a basic question about the checkpoint file or pretrained weight For image classification, the saved checkpoint file is a dict like utils.save_on_master({ 'model': model_without_ddp.state_dict(), 'optimizer': optimizer.state_dict(), 'lr_scheduler': lr_scheduler.state_dict(), 'epoch': epoch, 'model_ema': get_state_dict(model_ema), 'scaler': loss_scaler.state_dict(), 'args': args, 'max_accuracy': max_accuracy, }, checkpoint_path). when using model_ema, the checkpoint dict will contain both model and model_ema. I wonder which of these two will be used when loading the model into the downstream task. I check the load_checkpoint function used in UniFormer (follow Swin), and it seems choice model to load. So, will the model_ema not be used?

Andy1621 commented 1 year ago

Yes. Actually, we follow the code in DeiT and do not test model_ema in our codebase. In the downstream tasks, we only use the model instead of model_ema.

In my later experiments, model_ema also does not work for the current model. It may be more suitable for the lightweight models (FLOPs < 1G), which is a common technique for training those models.

go-ahead-maker commented 1 year ago

Thanks for your valuable reply! I want to ask you one more question. If I launch the model_ema during training, will it affect the training of the original model? Or will EMA copy the original model parameters and update them independently? Because I read the code of EMA, It seems first to copy the original model, and update it after the backward of the original model. So, I suppose that model_emadost not affect the original model's weight. Thank you again for your patience for this basic issue!

Andy1621 commented 1 year ago

Sorry for the late reply. Yes, model_ema will not affect the model parameters. It works like ensembling models via weighted parameters. Thus, it will not affect the original model's performance. But the ema model usually works better while training, that's why they are often used as teachers in contrastive learning. In my experiments, though it works better in the middle epochs, it achieves similar result in the end.

go-ahead-maker commented 1 year ago

So appreciate for your detailed explanations, It do help me a lot. looking forward to your future works~