Support Serverless orchestration and Async strategies

Describe the type of feature and its functionality.

We have used flwr for a large scale (100s of TB) medical imaging use case. Thank you for this great library which made our life much easier.

When operating thousands of long running experiments, we faced a few pain points, mainly:

Managing multiple servers for individual experiments became tedious, fragile and unsustainable.
Different institutions (clients) have very different data and compute properties that make the training speed and time very uneven and instable, to a point that the synchronization became a bottleneck.

To scratch our own itch, we implemented flwr_serverless as a wrapper of flwr for both Sync and Async strategies. It allows federated training to run without a central server that aggregates models. The core federation functionality is passed through to flwr strategies. We summarized our learnings on public data in this tech report. With the added robustness due to serverless+async, this implementation addressed our pain points and allowed us to do large scale experimentation using flwr FL for the past year. We think other teams may also find this feature useful. Feedback and critique are welcome.

PS: I should probably have submitted feature request a year ago, but better late than never to contribute upstream. I also noticed related work on Async, which seems to have increasing need for practical deployments.

best, ZZ AT kungfu.ai

Describe step by step what files and adjustments are you planning to include.

We implemented SyncFederatedNode and AsyncFederatedNode to handle commutation of model weights to a shared "Folder" (e.g. S3). And for tensorflow/keras, we implemented a FlwrFederatedCallback that is easy to plug into the user's training code. This callback holds the federated node, which in turn manages model federation. We haven't implemented torch integration but it could be similar.

An example usage:

# Create a FL Node that has a strategy and a shared folder.
from flwr.server.strategy import FedAvg  # This is a flwr federated strategy.
from flwr_serverless import AsyncFederatedNode, S3Folder
from flwr_serverless.keras import FlwrFederatedCallback

strategy = FedAvg()
shared_folder = S3Folder(directory="mybucket/experiment1")
node = AsyncFederatedNode(strategy=strategy, shared_folder=shared_folder)

# Create a keras Callback with the FL node.
num_examples_per_epoch = steps_per_epoch * batch_size # number of examples used in each epoch
callback = FlwrFederatedCallback(
    node,
    num_examples_per_epoch=num_examples_per_epoch,
    save_model_before_aggregation=False,
    save_model_after_aggregation=False,
)

# Join the federated learning, by fitting the model with the federated callback.
model = keras.Model(...)
model.compile(...)
model.fit(dataset, callbacks=[callback])

Is there something else you want to add?

No response

adap / flower