desternylin / perfed

15 stars 1 forks source link

I have some questions about the generation of matrix P. #2

Open 1853582 opened 11 months ago

1853582 commented 11 months ago

Hello, when I was testing your code, about the projection matrix P for subspace training, I found that for example, when I tried to compress a 10000-dimensional model to 100, the shape of P is ( 100,10000 ), P P ^ T is approximately equal to E, and P ^ T P is not equal to E, which seems difficult to use for model compression ? Or is there a problem with the way I reproduce ?

desternylin commented 11 months ago

I think the phenomenon is normal. From Proposition 7 in our appendix of the paper, all singular values of the projection matrix P are around 1, which makes it natural for P P^T to be approximately equal to E. On the other hand, since the rank of P^T P is the same as that of P P^T, the dimension of the former matrix would be 1000010000, while its rank is less than 100 (and would be probably 100), which implies that it should not be equal to E. I don't quite understand your point on the difficulty for model compression, maybe you can elaborate more on it?

1853582 commented 11 months ago

Thank you for your answer. Yes, precisely because of this, in my opinion, this method of model compression is irreversible. Because W=PX, and X is difficult to recover from W. So will the model trained in this final subspace training method not revert back to the original model size? Can you analyze why this compression method can achieve a higher compression ratio than compression methods like sketch?

desternylin commented 11 months ago

In our paper, we do not expect to recover the full model from the compressed model. Actually, in our algorithm, the subspace can be viewed as an auxiliary tool for the training. The model preserved by the central server can be viewed as a reference point for the clients, and we use it for regularization in the local training phase, and hence the regularization term in the local objective function is in the form of f(w - Px). We do not have theoretical analysis for the comparison in terms of compression ratio between our method and sketching-based algorithms currently. Intuitively, our proposed algorithm is a personalized FL algorithm, every client has their own model and is different from others, the similarities of the personalized models are captured by the global model in the projected subspace. However, even in the subspace, the projected personalized model for every client, i.e., Px, need not be exactly the same. On the other hand, sketching-based algorithms in the literature (as far as we know), are not personalized algorithms and the goal is to output a global full model, which requires more information to adapt to the heterogeneity across different clients.