[Feature Proposal] Distributed Training for FL

SMILELab-FL / FedLab

A flexible Federated Learning Framework based on PyTorch, simplifying your Federated Learning research.

https://fedlab.readthedocs.io

Apache License 2.0

739 stars 127 forks source link

[Feature Proposal] Distributed Training for FL #330

Open slyviacassell opened 1 year ago

slyviacassell commented 1 year ago

As the title described, does standalone mode support multiple GPUs to speed up training?

dunzeng commented 1 year ago

We didn't provide multiple GPUs in the standalone module. However, you can use the DP module of PyTorch in train function in SGDSerialClientTrainer.

slyviacassell commented 1 year ago

We define the following variables to further illustrate the idea:

K: the number of clients who participated in training each round
N: the number of available GPUs

When K == N, each selected client is allocated to a GPU to train.

When K > N, multiple clients are allocated to a GPU, then they execute training sequentially in the GPU.

When K < N, you can adjust to use fewer GPUs in training.

We need to set the number of GPUs in gpu and specific distributed settings in the distributed configs.

The implementation is under working. Anybody would like to help?

QiTianyu-0403 commented 1 month ago

We define the following variables to further illustrate the idea:

K: the number of clients who participated in training each round

N: the number of available GPUs

When K == N, each selected client is allocated to a GPU to train.

When K > N, multiple clients are allocated to a GPU, then they execute training sequentially in the GPU.

When K < N, you can adjust to use fewer GPUs in training.

We need to set the number of GPUs in gpu and specific distributed settings in the distributed configs.

The implementation is under working. Anybody would like to help?

I'm very interested in the function you mentioned. Is there any code available that can implement this function?