FLOPs Reduction & Calculation

I am particularly interested in understanding the mechanisms behind the reduction in GFLOPS achieved by the adaptive computation framework, and I have some questions I was not able to clarify myself:

My understanding is that GFLOPS are primarily tied to the number of computations performed, including matrix multiplications. However, I am unclear about how the reduction in the number of tokens affects the GFLOPS calculation, especially when the dimensions of the input tensors remain the same and the unused tokens are assigned a value of 0. in act_vision_transformer.py file:

class VisionTransformer(nn.Module):
      def forward_features_act_token(self, x):
            #assigned a value of 0
            out.data = out.data * mask_token.float().view(bs, self.total_token_cnt, 1)

class Block_ACT(nn.Module):
        def forward_act(self, x, mask=None):
              # original dimensions multiplication without reduction of dimensions 
              x = x + self.drop_path(self.attn(self.norm1(x*(1-mask).view(bs, token, 1))*(1-mask).view(bs, token, 1), mask=mask))
              x = x + self.drop_path(self.mlp(self.norm2(x*(1-mask).view(bs, token, 1))*(1-mask).view(bs, token, 1)))

Could you kindly provide insights into how the adaptive computation framework in A-Vit leads to GFLOPS reduction, even when the token dimensions are unchanged?

Could you elaborate on how you measure your GFLOPS?

Thanks in Advance.

NVlabs / A-ViT

FLOPs Reduction & Calculation #12