Why not use the masked transformers directly in the first two stages?

Alpha-VL / ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders

MIT License

477 stars 41 forks source link

Open xwan0527 opened 4 months ago

xwan0527 commented 4 months ago

Why use convolutions instead? Since upsampling is already employed to obtain the mask matrix, it seems like transformers could also be used.