AIoT-MLSys-Lab / FedRolex

[NeurIPS 2022] "FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction" by Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang
Apache License 2.0
57 stars 15 forks source link

Some questions about the 'overlap' and aggregation code #3

Closed Sherrylife closed 1 year ago

Sherrylife commented 1 year ago

Hi, @samiul272 , I am reading your code but unfortunately I'm confused about the following code. image Suppose that we have 2 clients and $\beta_1=\beta_2=1$, and suppose that there's a layer with only 10 neurons which means $K_i=10$. According to Appendix A.4 in your paper and the codes in the above picture, I draw the situation of overlap=0.2, 1.0 in the $j$-round communication in the following picture. Can you tell me if my understanding is right? If so, what does overlap mean? image

Why do we need to do this experiment? Besides, why do we mess with the original order of neurons (as shown below: line 51-line 52)? image

samiul272 commented 1 year ago

Hi, the overlap was something we tried to see if it gives better results. It basically means by how much the models will roll. As far as I remember, If overlap is 1.0 then the model rolls by 1neuron. If it is 0 the model rolls by the full model width. For full sized models it has no effect. This will influence the half sized or quarter sized models, etc. We however did not find any significant changes to accuracy when we changed the overlap of the model. As for why we reshuffle the parameters, mathematically it should lead to better convergence. However, again we did not see any marked improvement in convergence speed after reshuffle and in fact accuracy degrades. You should get better results without reshuffling.

On Mon, Feb 20, 2023, 9:28 AM Sherrylife @.***> wrote:

Hi, @samiul272 https://github.com/samiul272 , I am reading your code but unfortunately I'm confused about the following code. [image: image] Suppose that we have 2 clients and $\beta_1=\beta_2=1$, and suppose that there's a layer with only 10 neurons which means $K_i=10$. According to Appendix A.4 in your paper and the codes in the above picture, I draw the situation of overlap=0.2, 1.0 in the $j$-round communication in the following picture. Can you tell me if my understanding is right? If so, what does overlap mean? [image: image]

Why do we need to do this experiment? Besides, why do we mess with the original order of neurons (as shown below: line 51-line 52)? [image: image] https://user-images.githubusercontent.com/60345931/220131797-6084bf3f-bb78-4004-8976-2a2c31431c02.png

— Reply to this email directly, view it on GitHub https://github.com/AIoT-MLSys-Lab/FedRolex/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH34XDKC6B3FKF7FAWPM7MLWYN5PZANCNFSM6AAAAAAVB6IEIY . You are receiving this because you were mentioned.Message ID: @.***>

Sherrylife commented 1 year ago

Thank you for your reply. After my calculation, my personal opinion is that the 'overlap' seems to control the step size of your rolling window sliding.

Sherrylife commented 1 year ago

Besides, in the following picture (resnet_server.py, line 200-203), can you tell me why we multiply the coefficient in front of the local model local_parameters[m][k] by self.tmp_counts[k][param_idx[m][k]] (the value is not always 1) when we aggregate the BN layer? Why can't we just multiply by 1? image

Sherrylife commented 1 year ago

Another question that bothers me is why, in the final linear layer aggregation, do we only aggregate clients with the same label? (resnet_server.py line 210-211)? As shown in the red circle in the figure below, why do we need to add [label_split] ? I couldn't find a description of it in the original paper. The aggregation operation in your code seems to be a little different from Eq.(10) in the original paper. Can you explain it? image

image

samiul272 commented 1 year ago

Besides, in the following picture (resnet_server.py, line 200-203), can you tell me why we multiply the coefficient in front of the local model local_parameters[m][k] by self.tmp_counts[k][param_idx[m][k]] (the value is not always 1) when we aggregate the BN layer? Why can't we just multiply by 1?

I think this is to account for the number of updates. When aggregating we do a weighted average depending on the number of updates a parameter gets in a round and that is the p_m you see in the paper. All parameters are not updated equally as different clients have different models.

Another question that bothers me is why, in the final linear layer aggregation, do we only aggregate clients with the same label? (resnet_server.py line 210-211)? As shown in the red circle in the figure below, why do we need to add [label_split] ? I couldn't find a description of it in the original paper. The aggregation operation in your code seems to be a little different from Eq.(10) in the original paper. Can you explain it?

Here label split is the classes present in the client dataset. So the linear layer will only have those parameters for that client.

Sherrylife commented 1 year ago

Here label split is the classes present in the client dataset. So the linear layer will only have those parameters for that client.

Ok, I compare your code to HeteroFL, and I could interpret it as a special trick ( which is named Masking CrossEntropy in HeteroFL) to improve the stability of model training, but I honestly wouldn't recommend it.

I think this is to account for the number of updates. When aggregating we do a weighted average depending on the number of updates a parameter gets in a round and that is the p_m you see in the paper. All parameters are not updated equally as different clients have different models.

In your code, self.tmp_counts records the cumulative number of times each neuron in the largest model has been used in all communication rounds , while count records only the number of times each neuron has been used in the current round. However, the mismatch between using self.tmp_counts as the weighted weight for the BN layer and K as the weighted weight for the convolutional layer is what puzzles me. HeteroFL, on the other hand, has the same weighting coefficients (all 1). I don't know if this special trick of yours is conducive to the advantage of your algorithm, but I still think that the aggregation part of your code is not consistent with the description of the original paper (Eq. (10)).

All in all, I have learned a lot from your work, thank you!