Dan8991 / SMCEPR_pytorch

Implementation of the paper Scalable Model Compression by Entropy Penalized Reparametrization in pytorch
3 stars 0 forks source link

Questions aboout AffineDecoder #1

Open James89045 opened 5 months ago

James89045 commented 5 months ago

Thanks for your implementation! In my experiment, I tried to replace the l number with output channel number, but it appeared shape error, I'm wondering that what part I did wrongly. The code below is the modified part. Thank you! image

Dan8991 commented 5 months ago

First of all thank you for using this code and sorry, unfortunately I didn't have time to keep working on it so the code is a bit unorganized and not well commented. From what I recall, the variable l is supposed to be used only with convolutional layers, and I am not sure If I fully implemented support for those ones since I was using this code just with linear layers. If you want to make it work on the output dimension for linear layers you need to modify the EntropyLinear class and probably also the EntropyLayer class. I will look a little bit more into this as soon as I have got time. By the way, If you improve the code of the repo or if you implement something useful, feel free to open a pull request.

Dan8991 commented 5 months ago

I had a look at the paper and I don't think that it makes too much sense to have l = out_channels for linear layers since the decoder would have parameters of size l x l for the affine transform weight and l x 1 for the bias. Since you need to transmit these values and they are not optimized by the entropy model you will likely end up with a higher rate than the one required to simply transmit the original network, especially if out_channels > in_channels. As a general rule of thumb you need to make sure that the number of affine parameters is much smaller than the number of parameters in the original layer.

James89045 commented 5 months ago

Thank you for your response. I'm truly grateful for your contribution to the implementation of this paper, as I have a similar component to implement in my research. Having your code as a reference is very helpful. Additionally, regarding the use of l as the output channel might be a misunderstanding on my part related to the section in your Readme (as shown in the picture below). So, basically, should I just set l to 1 in all cases? image

Dan8991 commented 5 months ago

Thanks for pointing this out, I will update the README accordingly. From my understanding you should use l=1 when using linear layers and l=HxW when using convolutional layers. As I mentioned before take care when using my implementation of the convolutional layers because I didn't test it much and if I remember correctly it wasn't working as expected

James89045 commented 5 months ago

Thank you for yout remind! By the way, I've tried entropyLenet and normal Lenet, And the result showed that norma Lenet seemed to converge faster than entropyLenet, Is it normal? thank you very much!

Dan8991 commented 5 months ago

I think that it is safe to assume that EntropyLeNet should take longer to converge since, on top of the classification performance, it also needs to optimize the rate of the parameters.

James89045 commented 5 months ago

I also discovered that a regular LeNet can easily reach a test accuracy of over 0.97, while the EntropyLeNet can only achieve about 0.966. I just want to make sure if this is a normal phenomenon because the optimization of the parameter rate during the process might lead to a decrease in accuracy. Thank you very much!

Dan8991 commented 5 months ago

Yes it is normal, you are regularizing the network to allow compressing it so the classification performance drops a little. The same can be seen also in the original paper.

James89045 commented 4 months ago

Hello! Your code really help my research a lot! and recently I have one more question. The question is about using entropybottleneck. May I ask why you initialize CustomEntropyBottleneck(chennel = 1) in entropylinear and CustomEntropyBottleneck(channel=H*W) in entropyConv2d? thanks a lot!

Dan8991 commented 4 months ago

Entropy Bottleneck implements a factorized prior model. This means that the channel parameter defines how many probability distributions are learned by the entropy bottleneck to entropy code the data. Of course if you learn more probability distributions you can get a better modelling of your data, however the parameters for these distributions need to be transmitted to the decoder, so there is a tradeoff. In the original paper the authors use only one probability model in the linear layer (this is why I use channel=1). On the other hand for convolutional networks the authors suggested to learn in_channels*out_channels probability models, however that was leading to a big number of probability models because in_channels and out_channels are generally pretty big numbers. So I assumed they had a typo and were referring to the height and width which allow to have a more manageable number of distributions. However as I mentioned before I never finished fixing the convolutional entropy layer so the implementation is probably not correct. I did not have the time to properly test all the possibilities to find the best one.

James89045 commented 4 months ago

OK, I got it! Thank you very much, and I'll try in_channel * out_channel, if the result is better, I'll share it here.