BTW, I've tested the model on ImageNet1K and got the same inference time whether to load weights or not (The results are listed below). To my understanding, the model should predict faster due to adaptive token halt when loading the weights. Could you please tell me why would this happen?
Hi, Thanks for your excellent work. I have a question about the inference stage code. According to the paper, you simply remove the halted tokens from computation at the inference time. However, I can't find the corresponding code. It seems that the network operates in the same way during training and testing (I paste the related code below). https://github.com/NVlabs/A-ViT/blob/6dee7b65f373e484328c32a7384fbcc2b6aeb1d7/timm/models/act_vision_transformer.py#L422-L511 Am I missing something?
BTW, I've tested the model on ImageNet1K and got the same inference time whether to load weights or not (The results are listed below). To my understanding, the model should predict faster due to adaptive token halt when loading the weights. Could you please tell me why would this happen?