cvlab-stonybrook / SelfMedMAE

Code for ISBI 2023 paper "Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation"
Apache License 2.0
98 stars 10 forks source link

About maevitbackbone and unetr decoder #7

Closed kimsekeun closed 6 months ago

kimsekeun commented 6 months ago

Thank you for your paper,

1) When you apply the MAE vit, in MAE architecture there are encoder and decoder, did you used mae encoder ? did you use mask ratio of 0 to make same size of token embeddings to reconstruct to original size. if we have mask ratio, then token embeding size get smaller as you know.

2) did you freeze mae encoder which consist patch, block, normalization.

3) Finally, the architecture will be Mae encoder (before merging mask tokens in mae decoder, not using this) + Unetr decoder is the architecture that you used.?

thanks

kimsekeun commented 6 months ago
  1. Did you freeze mae encoder and only train unetr decoder? or just initialized with pre-trained weight on mae encoder?
BannyStone commented 6 months ago

Hi,

  1. We used an MAE encoder which only processes visible patches instead of all the patches while pre-training.
  2. In both the pre-training and fine-tuning stages, the encoder is not frozen.
  3. ViT encoder (MAE pre-trained) + UNETR decoder is the architecture
  4. We fine-tuned the whole network including the pre-trained encoder and unetr decoder.

Hope I have understood and answered all your questions well. Please follow up if you have further questions.

Best, Lei

kimsekeun commented 6 months ago

Thanks.

To summarize,

  1. MAE pre-training : MAE encoder (VIT) and MAE decoder (VIT) is used for pre-training MAE.
  2. Segmentation Downstream task : Especially, did not use MAE decoder (VIT) and used MAE encoder (VIT) with visible patches and UNETR decoder are used for down stream tasks.
BannyStone commented 6 months ago

For the downstream segmentation task, the MAE pre-trained encoder weights are used to initialize the weights of a ViT. After this, in the fine-tuning stage, all the patches are used and there is no masking in this stage. This is because ViT is a flexible architecture that can handle any set of patches with the same network.

kimsekeun commented 6 months ago

Great, thanks for clarification.