facebookresearch / hiera

Hiera: A fast, powerful, and simple hierarchical vision transformer.
Apache License 2.0
717 stars 36 forks source link

Is it possible to replace the backbone from hiera to other hierarchical models like Swin Transformer or CAFormer? #33

Closed Andy-Ko-0620 closed 3 days ago

Andy-Ko-0620 commented 4 weeks ago

I am researching fast and memory-efficient self-supervised pre-training compatible with different vision transformer architectures. In the third section of your paper (3.1. Preparing MViTv2), you stated "We choose MViTv2 as our base architecture, as its small 3 × 3 kernels are affected the least by the separate-and-pad trick described in Fig. 4d, though we likely could have chosen a different transformer and obtained a similar end result." Therefore, is it possible to replace the backbone from hiera to other hierarchical models like Swin Transformer or CAFormer?

dbolya commented 4 weeks ago

Hi, by "though we likely could have chosen a different transformer and obtained a similar end result", we didn't mean you could use any hierarchical transformer with MAE out of the box. What we meant by that is if you conduct the process we did in the methodology section of the paper (start with an off-the-shelf hierarchical transformer then remove / simplify components that are redundant when training with MAE), you likely would have obtained the same resulting architecture--i.e., the Hiera architecture.

Notice how we start with MViT, but Hiera as an architecture has very little in common with the MViT architecture itself. Same thing would happen if we started with a different transformer like Swin. Basically, Hiera is pretty much as basic of a hierarchical architecture as you can get, meaning its power relies mostly on the training method (i.e., MAE) just like the original ViT.

That isn't to say you couldn't use Swin / CAFormer with MAE. It's just that you'd have to make concessions / workarounds like we had to do for MViTv2 (in our case, "separate and pad") which slows down the training.

Andy-Ko-0620 commented 4 weeks ago

Thank you for your reply. Take Swin Transformer as an example, there are methods like UM-MAE/MCMAE enabling sparse MAE training. If I use their MAE and conduct the process that removes/simplifies redundant components, I likely would have obtained the same resulting architecture. Is my understanding correct?

dbolya commented 4 weeks ago

Yes, that is correct. For Swin specifically, you'd probably get rid of shifting in the process but there is a design choice of where to stop window attn. We did win attn in stages 1 and 2, but presumably you could also do it in stage 3 since stage 4 is still global. That's something we didn't explore.

Andy-Ko-0620 commented 3 weeks ago

Thank you for getting back to me. It sounds like you've already done some experiments with Swin.

If I understand correctly, you used window attention in stages 1 and 2, and global attention in stages 3 and 4. Given that the throughput of Swin-S is 436.9 im/s while ViT-S is 940.4 im/s, does this mean that replacing window attention with global attention can speed up inference time without accuracy drop when using MAE?

By the way, did you use UM-MAE/MCMAE for the Swin experiment, or did you use other methods to enable sparse MAE training?

dbolya commented 3 weeks ago

To clarify: we haven't run any experiments with Swin. I was talking hypothetically.

As for the speed difference between Swin-S and ViT-S, there are more differences between the two architectures than shifted window attn. While shifted window attn isn't as fast as normal window attn, the real reason Swin is so much slower is they simply have more layers / features overall. Swin-S has 24 layers while usually ViT-S has 12 layers. Note that Swin-L reverses this trend and has far fewer features / computation than ViT-L (in fact Swin-L is more comparable to our Hiera-B+ in terms of compute).

Put simply, the Swin authors didn't match up the different sizes of Swin very well against their ViT counterparts.