Closed kimsekeun closed 6 months ago
Hi,
Hope I have understood and answered all your questions well. Please follow up if you have further questions.
Best, Lei
Thanks.
To summarize,
For the downstream segmentation task, the MAE pre-trained encoder weights are used to initialize the weights of a ViT. After this, in the fine-tuning stage, all the patches are used and there is no masking in this stage. This is because ViT is a flexible architecture that can handle any set of patches with the same network.
Great, thanks for clarification.
Thank you for your paper,
1) When you apply the MAE vit, in MAE architecture there are encoder and decoder, did you used mae encoder ? did you use mask ratio of 0 to make same size of token embeddings to reconstruct to original size. if we have mask ratio, then token embeding size get smaller as you know.
2) did you freeze mae encoder which consist patch, block, normalization.
3) Finally, the architecture will be Mae encoder (before merging mask tokens in mae decoder, not using this) + Unetr decoder is the architecture that you used.?
thanks