facebookresearch / ToMe

A method to increase the speed and lower the memory footprint of existing vision transformers.
Other
931 stars 67 forks source link

Some doubts about Accuracy #11

Closed HelloWXJ1024 closed 1 year ago

HelloWXJ1024 commented 1 year ago

Thank you for your excellent work!

But when I apply ToMe into DeiT-S without training, I found the Acc is 78.826%, which is lower than the 79.4% as Table 4 reported. Do you know the gap?

anirudth commented 1 year ago

I also observe something similar. On applying this on finetuned ViT-L/16 MAE, I observe,

Acc@1 85.948 Acc@5 97.560 loss 0.646 (for r=0) (78 seconds runtime) Acc@1 81.984 Acc@5 96.350 loss 1.037 (for r=8) (57 seconds runtime) without prop attn. Speedup is 1.36x;

However, in the paper, under Table 10(a), the numbers reported are Acc@1 85.66 (for r=0) Acc@1 83.92 (for r=8) and 1.97x speedup

Please clarify

dbolya commented 1 year ago

Hi, thanks for your interest!

But when I apply ToMe into DeiT-S without training, I found the Acc is 78.826%, which is lower than the 79.4% as Table 4 reported. Do you know the gap?

The DeiT number in that table is with training (as marked by the gray color). The off-the-shelf number below that (as marked by blue) is an AugReg model (i.e., timm). If you want to reproduce that number, you can use the off the shelf model for ViT-S in timm.

I also observe something similar. On applying this on finetuned ViT-L/16 MAE, I observe,

There is a possibility that the MAE implementation in the released implementation isn't correct. Thanks for testing this, I will look into this. The timm implementation is correct, however.

Speedup is 1.36x

As for timing, make sure you're only timing the model speed itself. If you're timing the entire dataset evaluation, then you're factoring in things like data loading and moving things from cpu to gpu which don't have anything to do with the model.

dbolya commented 1 year ago

There was indeed a missing feature not implemented for the MAE code. I have created a separate patch for MAE to deal with this.

Now when running evaluation for an off-the-shelf MAE model I get

Acc@1 85.96 (for r=0)
Acc@1 84.22 (for r=8)

Which is as described in Table 1 of the paper (which shows 84.25).

For timing (using utils.benchmark) I get

[r=0] Throughput: 242.96 im/s
[r=8] Throughput: 474.59 im/s
1.95x speed-up