A question about inference

Hi, Thanks for your excellent work. I have a question about the inference stage code. According to the paper, you simply remove the halted tokens from computation at the inference time. However, I can't find the corresponding code. It seems that the network operates in the same way during training and testing (I paste the related code below). https://github.com/NVlabs/A-ViT/blob/6dee7b65f373e484328c32a7384fbcc2b6aeb1d7/timm/models/act_vision_transformer.py#L422-L511 Am I missing something?

BTW, I've tested the model on ImageNet1K and got the same inference time whether to load weights or not (The results are listed below). To my understanding, the model should predict faster due to adaptive token halt when loading the weights. Could you please tell me why would this happen?

the result (no weight)

42011db10dfcbbdf34bbe6d062d90fd

the result (load official weight)

bb3ccb39cb6861666a2249ce2f69091

NVlabs / A-ViT

A question about inference #1