NVlabs / A-ViT

Official PyTorch implementation of A-ViT: Adaptive Tokens for Efficient Vision Transformer (CVPR 2022)
Apache License 2.0
138 stars 12 forks source link

FLOPs Reduction & Calculation #12

Open oryany12 opened 10 months ago

oryany12 commented 10 months ago

I am particularly interested in understanding the mechanisms behind the reduction in GFLOPS achieved by the adaptive computation framework, and I have some questions I was not able to clarify myself:

My understanding is that GFLOPS are primarily tied to the number of computations performed, including matrix multiplications. However, I am unclear about how the reduction in the number of tokens affects the GFLOPS calculation, especially when the dimensions of the input tensors remain the same and the unused tokens are assigned a value of 0. in act_vision_transformer.py file:

class VisionTransformer(nn.Module):
      def forward_features_act_token(self, x):
            #assigned a value of 0
            out.data = out.data * mask_token.float().view(bs, self.total_token_cnt, 1)
class Block_ACT(nn.Module):
        def forward_act(self, x, mask=None):
              # original dimensions multiplication without reduction of dimensions 
              x = x + self.drop_path(self.attn(self.norm1(x*(1-mask).view(bs, token, 1))*(1-mask).view(bs, token, 1), mask=mask))
              x = x + self.drop_path(self.mlp(self.norm2(x*(1-mask).view(bs, token, 1))*(1-mask).view(bs, token, 1)))

Could you kindly provide insights into how the adaptive computation framework in A-Vit leads to GFLOPS reduction, even when the token dimensions are unchanged?

Could you elaborate on how you measure your GFLOPS?

Thanks in Advance.

YuanYeshang commented 7 months ago

Actually, the author simply omits the token that is halted when in inference. But that cause a another question that omiting tokens will lead to different tensor shape between samples so that it is impossible to include them in one batch. So it is impossible to calculate with a batch size 64 or what.