FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.2k stars 788 forks source link

NCCL Simulation #519

Closed Mirian-Hipolito closed 2 years ago

Mirian-Hipolito commented 2 years ago

Hey there! I was trying to run a simulation using NCCL fedavg example available in the version 0.7.300, however I decided to clone again the repo to obtain the latest changes and now I don't see any NCCL example available anymore. Am I looking in the wrong place? if so, can you please point me to any of these scripts?

Thank you! :)

chaoyanghe commented 2 years ago

@Mirian-Hipolito we've removed this backend in our latest iteration. Please refer to other backend examples at https://github.com/FedML-AI/FedML/tree/master/python/examples

Mirian-Hipolito commented 2 years ago

Hi @chaoyanghe,

I decided to run the "_fedemnistcnn" simulation example (https://github.com/FedML-AI/FedML/blob/master/python/examples/simulation/mpi_fedavg_datasets_and_models_example/config/fedemnist_cnn/fedml_config.yaml) using the configuration below for 4GPUs with MPI (which I read in the documentation should be the fastest) , however it took around ~10 hours to finish the experiment. Is this amount of time expected?

If not, please let me know if I can still optimize the configuration. I'm currently running 0.7.300 version.

image

chaoyanghe commented 2 years ago

@Mirian-Hipolito If you set 'client_num_per_round' = 10, please also change 'worker_num' = 10. For your mapping_config1_5, you need an assignment like [3, 3, 3, 2]

Mirian-Hipolito commented 2 years ago

@chaoyanghe So, I switched to the 'mapping_default' which has the same setup you suggested above but seems like the speed doesn't change too much. It's been running for two hours now and it's on iteration ~275.

image

chaoyanghe commented 2 years ago

Not sure about the baseline using a single GPU. Given that the round number is more than 1000, the speed seems reasonable to me.

Mirian-Hipolito commented 2 years ago

Alright, just to clarify I'm using 4 GPUs. Which I think the image above is the right setup according to the comments. Just wanted to have an idea if this was the fastest configuration, thank you!