Open 5g4s opened 1 year ago
1) Random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task. 2) Predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs. 3) The prediction head can be as light as a linear layer, with no worse performance than heavier ones.
https://openaccess.thecvf.com/content/CVPR2022/papers/Xie_SimMIM_A_Simple_Framework_for_Masked_Image_Modeling_CVPR_2022_paper.pdf