Why the compressed model using TT is slower than the non-compressed model?

miladdona commented 2 years ago

I am trying to factorize the LeNet300 model (including only 3 FC layers (784x300), (300x100), (100x10)). I have factorized only the first layer with shape 784x300 using t3f. After fine-tuning I have good results in tense of accuracy. Also using this I compressed the mode from 266610 params to 49170 params (about 81% compression). But results are not good when I tried to get execution time. execution time for 10 times prediction over the test data (includes 10000 data images) is as follows: baseline model (without factorization) = 5.51 s factorized model = 5.57 s

factorization configuration is: 784x300 ----> [[2, 392], [20, 15]] and max_tt_rank = 3

while the FLOPs for the baseline model is: 532810 FLOPs and for factorized model is: 116486 FLOPs (about 78% decrease FLOPs) I should mention that I calculate the FLOPs for factorized layer using this link from you: https://colab.research.google.com/drive/16S_SUbIjhnQBFj_r7sCpbwZNHADIzEwX?usp=sharing

Also to calculate FLOPs for non-factorized layer I use this correlation: 2 (input_dim output_dim) + outputdim

What is the problem that we decrease the number of FLOPs but get worse results than baseline?

Bihaqo commented 2 years ago

Hi, By "worse results" you mean that it's good accuracy but bad inference speed, right? I would suggest trying a more balanced factorization, e.g. [[28, 28], [20, 15]]. Also, the bigger the initial layer, the bigger are the gains (both in terms of compression and speed), so you might want to start with a bigger network to get better improvements.

And finally, I wouldn't expect TT to be amazing in terms of reducing running time (unless applied to gigantic layers). GPUs (and TPUs) are so good at multiplying big matrices, that when you do something smart to speed it up, you usually reduce FLOPs by a lot, but don't reduce running time that much. Saving memory is usually much easier.

miladdona commented 2 years ago

Yes, you are right. In general, the accuracy is acceptable and also we can improve it using fine-tuning. But inference time is not good, execution time is sometimes 2X more than baseline!! I have tested for balanced configurations and no good results. also I have tested on layers with shape 4096x4096 that is big layer, the result is better but still not acceptable! Yes, you are right about GPUs and TPUs, but at the first place I am trying to run on CPUs. thanks anyway!

Bihaqo / t3f

Why the compressed model using TT is slower than the non-compressed model? #222