NVlabs / A-ViT

Official PyTorch implementation of A-ViT: Adaptive Tokens for Efficient Vision Transformer (CVPR 2022)
Apache License 2.0
138 stars 12 forks source link

A question about inference #1

Closed DYZhang09 closed 1 year ago

DYZhang09 commented 1 year ago

Hi, Thanks for your excellent work. I have a question about the inference stage code. According to the paper, you simply remove the halted tokens from computation at the inference time. However, I can't find the corresponding code. It seems that the network operates in the same way during training and testing (I paste the related code below). https://github.com/NVlabs/A-ViT/blob/6dee7b65f373e484328c32a7384fbcc2b6aeb1d7/timm/models/act_vision_transformer.py#L422-L511 Am I missing something?

BTW, I've tested the model on ImageNet1K and got the same inference time whether to load weights or not (The results are listed below). To my understanding, the model should predict faster due to adaptive token halt when loading the weights. Could you please tell me why would this happen?

42011db10dfcbbdf34bbe6d062d90fd

bb3ccb39cb6861666a2249ce2f69091

hongxuyin commented 1 year ago

Hi yes that snippet is not yet released we will enable dynamic zipping, distillation, and base model in coming versions.