facebookresearch / ToMe

A method to increase the speed and lower the memory footprint of existing vision transformers.
Other
916 stars 67 forks source link

End-to-end inference doesn't be accelerated. #39

Open rwshihhh opened 2 months ago

rwshihhh commented 2 months ago

Hi, thanks for your excellent work! I'm quite interested in your approach to speedup ViT's throughput. However, when I implement ViT-B end-to-end inference (including data Input, preprocessing, and model inference), the processing time is the same whether using ToMe or not. I even tried using different batch_size to fill the GPU memory, but the results are still the same. Here's the result:

For every test case, I only change the model or batch_size. Other components for data Input, preprocessing.... are the same. (the same device and code)

My question is why the "Total Inference Time" of models with ToMe are similar to baseline (No ToMe)? Didn't throughput mean the efficiency for model inference? Even if I didn't optimize the code for data input and data preprocessing, the "Total Inference Time" still should smaller than the baseline because the ToMe can speed up the time spent in model inference. Did I misunderstand something?

dbolya commented 2 months ago

Unfortunately, ToMe is not magic. It can only possibly speed up the total inference time if the inference time if bottlenecked by the model. So if you aren't performing enough computation to actually saturate your graphics card, or if the eval has to wait on something else in your pipeline (e.g., dataloading) then no model-based method can speed up your pipeline.

That being said, have you tried checking if ToMe improves inference speed if you just time inference, not the whole pipeline? As a sanity check.

If ToMe properly reduces that speed, then what that means is your pipeline is just constantly waiting on the dataloader. It doesn't matter how fast the model is---ViT-Ti or ResNet-50 or whatever---you'd get the same overall time because the dataloader can't load images fast enough.

rwshihhh commented 2 months ago

Thank you for your suggestions. I first validated that ToMe was installed correctly by using the examples/1_benchmark_timm.ipynb you provided, and I was able to measure the improvement in throughput.

Back to the previous issue, I have broken down the complete E2E inference into three parts: Part 1. load data from disk to DRAM to GPU. (Including data preprocessing) -> Variable in the code: count_load_whole Part 2. model inference, e.g., code's model(input) -> Variable in the code: count_model Part 3. remaining parts, e.g., calculate label (inference accuracy) -> Variable in the code: count_label_cal 000

I have following questions:

  1. When you mention that ToMe improves inference speed, are you referring to throughput (images/sec)? If so, does it relate to Part 2 of my E2E inference code? If that's the case, why does using ToMe make Part 2 take longer? For instance, when r=0, Part 2 takes 7.4 seconds, but when r=13, it takes 12.4 seconds.

  2. Since ToMe's approach is inserting a bipartite algorithm between ViT's Attention and MLP, shouldn't it only affect the model's architecture and its processing time? If so, why does using ToMe also change the processing time of Part 1 and Part 3? For example, with a higher r value, it reduces the time consumed by Part 3 and increases Part 1's.

dbolya commented 2 months ago

I think you're misunderstanding how cuda calls work. Most cuda calls are asynchronous and thus return immediately, meaning that timing the call itself is not the right thing to do. In order to force the code to wait for cuda operations to complete, you should do a torch.cuda.synchronize() before every time you sample the current time.