Open oryany12 opened 10 months ago
Actually, the author simply omits the token that is halted when in inference. But that cause a another question that omiting tokens will lead to different tensor shape between samples so that it is impossible to include them in one batch. So it is impossible to calculate with a batch size 64 or what.
I am particularly interested in understanding the mechanisms behind the reduction in GFLOPS achieved by the adaptive computation framework, and I have some questions I was not able to clarify myself:
My understanding is that GFLOPS are primarily tied to the number of computations performed, including matrix multiplications. However, I am unclear about how the reduction in the number of tokens affects the GFLOPS calculation, especially when the dimensions of the input tensors remain the same and the unused tokens are assigned a value of 0. in act_vision_transformer.py file:
Could you kindly provide insights into how the adaptive computation framework in A-Vit leads to GFLOPS reduction, even when the token dimensions are unchanged?
Could you elaborate on how you measure your GFLOPS?
Thanks in Advance.