adap / flower

Flower: A Friendly Federated AI Framework
https://flower.ai
Apache License 2.0
5.15k stars 881 forks source link

Support Serverless orchestration and Async strategies #4273

Open zzsi opened 1 month ago

zzsi commented 1 month ago

Describe the type of feature and its functionality.

We have used flwr for a large scale (100s of TB) medical imaging use case. Thank you for this great library which made our life much easier.

When operating thousands of long running experiments, we faced a few pain points, mainly:

To scratch our own itch, we implemented flwr_serverless as a wrapper of flwr for both Sync and Async strategies. It allows federated training to run without a central server that aggregates models. The core federation functionality is passed through to flwr strategies. We summarized our learnings on public data in this tech report. With the added robustness due to serverless+async, this implementation addressed our pain points and allowed us to do large scale experimentation using flwr FL for the past year. We think other teams may also find this feature useful. Feedback and critique are welcome.

PS: I should probably have submitted feature request a year ago, but better late than never to contribute upstream. I also noticed related work on Async, which seems to have increasing need for practical deployments.

best, ZZ AT kungfu.ai

Describe step by step what files and adjustments are you planning to include.

We implemented SyncFederatedNode and AsyncFederatedNode to handle commutation of model weights to a shared "Folder" (e.g. S3). And for tensorflow/keras, we implemented a FlwrFederatedCallback that is easy to plug into the user's training code. This callback holds the federated node, which in turn manages model federation. We haven't implemented torch integration but it could be similar.

An example usage:

# Create a FL Node that has a strategy and a shared folder.
from flwr.server.strategy import FedAvg  # This is a flwr federated strategy.
from flwr_serverless import AsyncFederatedNode, S3Folder
from flwr_serverless.keras import FlwrFederatedCallback

strategy = FedAvg()
shared_folder = S3Folder(directory="mybucket/experiment1")
node = AsyncFederatedNode(strategy=strategy, shared_folder=shared_folder)

# Create a keras Callback with the FL node.
num_examples_per_epoch = steps_per_epoch * batch_size # number of examples used in each epoch
callback = FlwrFederatedCallback(
    node,
    num_examples_per_epoch=num_examples_per_epoch,
    save_model_before_aggregation=False,
    save_model_after_aggregation=False,
)

# Join the federated learning, by fitting the model with the federated callback.
model = keras.Model(...)
model.compile(...)
model.fit(dataset, callbacks=[callback])

Is there something else you want to add?

No response

leeloodub commented 18 hours ago

Thanks @zzsi - I am not part of Flower team, but I have been looking for something similar in terms of having a serverless implementation. I would be interested in understanding whether it makes sense to create an architecture proposal around this for Flower team, and a PR.

Additionally, instead of having a central storage capability I was hoping to have a peer-to-peer system and I believe you've comments in your code about it. For that, clients could gossip updates on their model wieghts to each other I would imagine.

Will comment more on your repo, commenting here since your repo has not had any activity in the past 8 months.