facebookresearch / long_seq_mae

code release of research paper "Exploring Long-Sequence Masked Autoencoders"
Other
99 stars 0 forks source link

May you release the code to finetune on ADE20K? #1

Open Wallace-222 opened 2 years ago

Wallace-222 commented 2 years ago

Hello authors, it is really an inspiring work, and it is also very kind of you to release the code at the same time. May you please also release the code to finetune on ADE20K? Although it is stated in your paper that your experiments are simply following the MAE. However, I am unable to find it from the official code of MAE. Thanks a lot for your attention. Best wishes.

ronghanghu commented 2 years ago

Hi @Wallace-222, sorry that we don't have a code release for the ADE20K segmentation model yet. For this experiment, we follow the implementation in the MAE paper. I think one can adapt it from the Swin-Transformer repo (https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation) and replace the backbone with ViT for this experiment.

youngwanLEE commented 1 year ago

@ronghanghu Hi, In the original MAE paper and their official code, the hyper-parameter information (e.g., learning rate, weight decay) is not informed. It would be helpful to share the hyper-parameters in the community :).

ronghanghu commented 1 year ago

Hi @youngwanLEE, we follow the same setting in BEiT, MAE, and ConvNeXt for the ADE20K experiments and sweep the hyperparameters. Our final hyperparameters are as follows:

Hope these are helpful!

youngwanLEE commented 1 year ago

@ronghanghu It would be very helpful in community :)

Many thanks!

ggjy commented 1 year ago

@ronghanghu Hi ronghang, thanks for sharing this great work. Table. 2(c) of the main papers indicates that models with different patch size (e.g., 8, 16, 24) are finetuned on COCO/ADE20K under the same transferring input sizes, so these three models have different GPU memory usage and different FLOPs but can still have similar results, as I understand it correctly?

ronghanghu commented 1 year ago

@ronghanghu Hi ronghang, thanks for sharing this great work. Table. 2(c) of the main papers indicates that models with different patch size (e.g., 8, 16, 24) are finetuned on COCO/ADE20K under the same transferring input sizes, so these three models have different GPU memory usage and different FLOPs but can still have similar results, as I understand it correctly?

Hi @ggjy, during COCO (and similarly ADE20K) fine-tuning, all the three pre-trained model in Table 2(c) are fine-tuned with the same ViT patch size of 16 and same image size of 1024, so they have the same GPU memory usage and FLOPs during fine-tuning. This is mentioned in the "Setups" in Sec. 4.1 (Page 4 right column).

Notably, the Mask R-CNN detection backbone always uses a ViT with patch size 16 and image size 1024 × 1024, and hence a fixed sequence length of L = 4096 during detection fine-tuning for all pre-trained models. When there is a mismatch in sizes, the ViT position embeddings are bicubic-interpolated to L = 4096 following the practice in [30]. The same is applied to patch embedding layers, where the weights are treated as 2D convolution filters and bicubic-interpolated when needed [29]

ggjy commented 1 year ago

Got it! Thanks very much for your quick reply.