Open zzsi opened 1 month ago
Thanks @zzsi - I am not part of Flower team, but I have been looking for something similar in terms of having a serverless implementation. I would be interested in understanding whether it makes sense to create an architecture proposal around this for Flower team, and a PR.
Additionally, instead of having a central storage capability I was hoping to have a peer-to-peer system and I believe you've comments in your code about it. For that, clients could gossip updates on their model wieghts to each other I would imagine.
Will comment more on your repo, commenting here since your repo has not had any activity in the past 8 months.
Describe the type of feature and its functionality.
We have used
flwr
for a large scale (100s of TB) medical imaging use case. Thank you for this great library which made our life much easier.When operating thousands of long running experiments, we faced a few pain points, mainly:
To scratch our own itch, we implemented flwr_serverless as a wrapper of
flwr
for both Sync and Async strategies. It allows federated training to run without a central server that aggregates models. The core federation functionality is passed through toflwr
strategies. We summarized our learnings on public data in this tech report. With the added robustness due to serverless+async, this implementation addressed our pain points and allowed us to do large scale experimentation usingflwr
FL for the past year. We think other teams may also find this feature useful. Feedback and critique are welcome.PS: I should probably have submitted feature request a year ago, but better late than never to contribute upstream. I also noticed related work on Async, which seems to have increasing need for practical deployments.
best, ZZ AT kungfu.ai
Describe step by step what files and adjustments are you planning to include.
We implemented
SyncFederatedNode
andAsyncFederatedNode
to handle commutation of model weights to a shared "Folder" (e.g. S3). And for tensorflow/keras, we implemented aFlwrFederatedCallback
that is easy to plug into the user's training code. This callback holds the federated node, which in turn manages model federation. We haven't implemented torch integration but it could be similar.An example usage:
Is there something else you want to add?
No response