Open Sorna-Meena opened 2 years ago
Hi,
I want to use flower on android as well. Are you using the app provided by flower? If so, could you please tell me how you install the app? And maybe the type of your mobile as well. I tried installing it on 3 different Android mobiles, and none of them worked.... :(
Hi @Victoria-Wei ,
I followed the steps as mentioned in the repo: https://github.com/adap/flower/tree/main/examples/android. You can install the app from the link https://www.dropbox.com/s/e14t3e9py3mr73v/flwr_android_client.apk?dl=1. This link mentioned in the download_apk.sh file too.
Try installing the app in a mobile that has android version>=11.
Hope that works!!
Dear @Sorna-Meena,
Thank you for your reply!!! It might be the problem of the android version since I did the exact same thing as you mentioned, except that my mobiles have android version < 10. I'll look for emulator to install the app then.
Thank you again!!!!!!!
Certainly the model is loaded in the Android client side as shown in the next source code portion of examples/android/client/app/src/main/java/flwr/android_client/TransferLearningModelWrapper.java
:
TransferLearningModelWrapper(Context context) {
model =
new TransferLearningModel(
new AssetModelLoader(context, "model"),
Arrays.asList("cat", "dog", "truck", "bird",
"airplane", "ship", "frog", "horse", "deer",
"automobile"));
It would be good to create a new server-client application that would be able to accept new models from the server and load them through the network instead. How could we achieve this? I have some idea of what could be done in an easy way that it should not represent a huge source code change.
The main idea would be to load the model that is offered by the server in the shape of an URL when the client asks for it. Currently, the model is loaded by using the AssetModelLoader, which is defined as next in examples/android/client/transfer_api/src/main/java/org/tensorflow/lite/examples/transfer/api/AssetModelLoader.java
:
public AssetModelLoader(Context context, String directoryName) {
this.directoryName = directoryName;
this.assetManager = context.getAssets();
}
This states that "model"
is the directory name for our client call. The model used is not in the repository, its position is defined in examples/android/client/app/build.gradle
build configuration:
def modelUrl = 'https://www.dropbox.com/s/tubgpepk2q6xiny/models.zip?dl=1'
def modelArchivePath = "${buildDir}/model.zip"
def modelTargetLocation = 'src/main/assets/model'
This model is accompanied by its training data as well.
If we would want to use another model no changes are required for the strategy file src/py/flwr/server/strategy/fedavg_android.py
but the next steps would be required for the example android project:
TransferLearningModelWrapper
initialization function source code (TransferLearningModel
array declaration) and IMAGE_SIZE
should be modified; if input image size is not square some other modifications might be required.modelUrl
and dataUrl
should be modified for using other model and training dataset content.@sisco0 Sorry for the late reply and thank you very much for your detailed response! I now understand how the model is loaded on the client side.
However, I do have a few questions. First, it is still unclear to me why after the federated learning process is over, the server script terminates.
Secondly, after the FL process is over where is the global model stored. Ideally in a FL setup, once a pre-defined criterion is met (here the 'number of rounds'), the server aggerates the updates and finalizes the global model. The next time a federation round starts, the clients use this updated global model for training. How is this implemented in flower?
Lastly, I want to use the updated global model i.e., the model obtained from the previous round of FL when I once again run the server script. How can this be done?
Thank you again!!
On the question about why the federated learning is over after a certain number of rounds? The answer to that is found at the num_rounds
configuration parameter for the server; when the number of rounds set is reached we call to disconnect_all_clients()
which performs a graceful shutdown by calling to shutdown()
and disconnecting all the clients. The disconnect_all_clients()
is the last function being executed at _fl()
, which is near the end of the start_server()
function, so that is the reason why your server shuts down after a fixed number of training rounds is had.
https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/app.py#L132 https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/server.py#L287-L290
How could we get an infinite number of rounds so our server is always learning? We should look into the server.fit()
function. In this function we have the loop that is running until the number of rounds set condition is met. Currently, there is no way to do the infinite loop in here, as we are using a current_round
range-based for loop in our server.fit()
function. Maybe we could modify this loop to be a while true
or setting a new configuration option for cases where num_rounds == -1
, but this is not currently implemented (it is an easy source code modification, but it is not implemented). If you are going to implement it fast you could use and endless loop but I would consider to use some other kind of flag-based approach for gently shutting down the server in a way that we are not just pressing Ctrl+C in the middle of a weights saving process. You could use signal.SIGINT
and create a signal handler function for sure, or maybe just KeyboardInterrupt
exception, which possibly catches when SIGINT
appears (maybe you should attach the signal hook somehow).
https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/app.py#L108 https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/server.py#L136
On the question about where is my trained model during and after the training process? At the end of each training round, we are storing our parameters into self.parameters
at the Server
class instance, which are grabbed from the res_fit
variable, which is produced by the self.fit_round()
call. We could store models during training, that process is normally referred as storing model checkpoints and this could be a function defined in our server strategy; for example, we call savez
at the SaveModelStrategy
for each round that has been run. You could use this savez
function for any modified strategy that you want. This is also performed at the end of our last training round, so we store it at the end as well. For sure you could replace the already present data file for each new round.
https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr/server/server.py#L137-L142 https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr_example/pytorch_save_weights/server.py#L30-L42
Then, how do I start with my last shiny trained model that I stored at checkpoints? Just set the initial_parameters
parameter using weight_to_parameters
after loading up your weights file (which you stored previously using savez
). An example for Tensorflow is attached below.
@sisco0 Thank you very much for your detailed explanation!! It was very helpful for me to understand.
But I did notice something in the saving model using SaveModelStrategy
method you mentioned.
https://github.com/adap/flower/blob/67cb2f37dac076c7ef62ff34145cd2c2545fc310/src/py/flwr_example/pytorch_save_weights/server.py#L30-L42
In the line 37, does the super().aggregate_fit()
function actually return the aggregated parameters and not weights. Because I keep encountering the following error.
Error:
Traceback (most recent call last):
File "flower/examples/android/server.py", line 118, in <module>
main()
File "flower/examples/android/server.py", line 94, in main
initial_parameters=fl.common.weights_to_parameters(weights),
File "C:\Users\xxxxxx\miniconda3\envs\flower\lib\site-packages\flwr\common\parameter.py", line 28, in weights_to_parameters
tensors = [ndarray_to_bytes(ndarray) for ndarray in weights]
TypeError: iteration over a 0-d array
Process finished with exit code 1
After I changed the lines 37-42 as follows I didn't encounter the above error but the fitting_round fails for all clients and the server output either freezes after the fit_round
(shown in Output below) or returns the weights
to be None
which in turn makes it unable to save the model weights.
Code:
aggregated_parameters_tuple = super().aggregate_fit(rnd, results, failures)
aggregated_parameters, _ = aggregated_parameters_tuple
if aggregated_parameters is not None:
print(f"Saving round {rnd} aggregated_weights..")
# Convert `Parameters` to `List[np.ndarray]`
aggregated_weights: List[np.ndarray] = fl.common.parameters_to_weights(aggregated_parameters)
np.savez(f"round-{rnd}-weights.npz", aggregated_weights)
return aggregated_weights
Output:
INFO flower 2022-01-07 18:19:05,236 | server.py:106 | Initializing global parameters
INFO flower 2022-01-07 18:19:05,236 | server.py:290 | Using initial parameters provided by strategy
INFO flower 2022-01-07 18:19:05,236 | server.py:109 | Evaluating initial parameters
INFO flower 2022-01-07 18:19:05,237 | server.py:122 | FL starting
DEBUG flower 2022-01-07 18:21:21,477 | server.py:241 | fit_round: strategy sampled 2 clients (out of 2)
DEBUG flower 2022-01-07 18:21:22,048 | server.py:250 | fit_round received 0 results and 2 failures
Saving round 1 aggregated_weights...
DEBUG flower 2022-01-07 18:22:18,744 | server.py:190 | evaluate_round: strategy sampled 2 clients (out of 2)
Even when I try to reconnect the clients back to the server, my server output freezes in the evaluate_round
line i.e., last line in the output above and there is also no response from the app (the client side).
Please note that I have not set any eval_fn
in my Strategy
initialization. Can the failure of rounds be due to that? If that is the case, what should I initialize the model as in eval_fn
for the android client example of flower?
Also, I would like to know how are the global model parameters being sent to the clients (app) and how can the model be saved in the client side (android device/android emulator).
Thank you once again!
As you wisely pointed out, we are really returning parameters from the FedAvg
(which is the super class for this strategy) aggregate_fit()
function as it could be seen in src/py/flwr/server/strategy/fedavg.py:257
. Then, we should fix this behavior just by adding a parameters_to_weights()
call in src/py/flwr_example/pytorch_save_weights/server.py:37
. That would solve the savez issue that you are having at this moment, as we were passing parameters instead of weights.
This error was introduced by commit 79bcf952 (2021-05-09), which was after the example was created by commit 3f06d544 (2020-12-03) and deprecation messages on using weights started to show up. The real root cause of this error in this example is that we are not setting any poetry project under it, so no pyproject.toml
file fixes the flower version where the example should run on (which should be any version before 2021-05-09), this is normally accomplished by creating a poetry project as it could be seen on other examples.
You have two alternatives here to fix this error:
aggregate_fit()
as a function that returns weights (and does not return parameters). Possibly creating a pyproject.toml
file for maintaining this example and contributing to Flower repository.savez()
function so it works with the latest Flower release and create the pyproject.toml
file with a README.md
file containing instructions on how to go on with this example (This is my recommended option).By the way, are the new examples expected to live under examples
or under src/py/flwr_example
folder @tanertopal @danieljanes ?
I attach the related source code portions for you to understand better. https://github.com/adap/flower/blob/1c933184e2fd6ad0476064678b8966a2f3728624/src/py/flwr/server/strategy/fedavg.py#L240-L257 https://github.com/adap/flower/blob/1c933184e2fd6ad0476064678b8966a2f3728624/src/py/flwr_example/pytorch_save_weights/server.py#L30-L42
The next list of questions need to be answered:
@sisco0 Thank you for your immediate response!
As you suggested, I have tried out your recommended option to fix the error. But I get the following error:
Error
DEBUG flower 2022-01-10 16:06:46,573 | server.py:241 | fit_round: strategy sampled 2 clients (out of 2)
DEBUG flower 2022-01-10 16:07:08,440 | server.py:250 | fit_round received 2 results and 0 failures
Traceback (most recent call last):
File "C:\Users\xxxxxx\miniconda3\envs\flower\lib\site-packages\numpy\lib\npyio.py", line 444, in load
raise ValueError("Cannot load file containing pickled data "
ValueError: Cannot load file containing pickled data when allow_pickle=False
Process finished with exit code -1
Instead of converting parameters to weights using fl.coomon.paramters_to_weights
function I used the following below code to avoid the valueError.
Code
class SaveModelStrategy(fl.server.strategy.FedAvg):
def aggregate_fit( self, rnd: int, results: List[Tuple[fl.server.client_proxy.ClientProxy, fl.common.FitRes]],
failures: List[BaseException], ) -> Optional[fl.common.Weights]:
aggregated_parameters_tuple = super().aggregate_fit(rnd, results, failures)
aggregated_parameters, _ = aggregated_parameters_tuple
if aggregated_parameters is not None:
# Save aggregated_weights
weights_list = [np.frombuffer(tensor) for tensor in aggregated_parameters.tensors]
print(f"Saving round {rnd} aggregated_weights...")
np.savez(f"round-{rnd}-weights.npz",weights_list)
return aggregated_parameters_tuple
Though my error is fixed , the training always fails for all clients during the fit_round
and and the server output either freezes after the fit_round
or returns the weights
to be None
which in turn makes it unable to save the model weights. Even when I try to reconnect the clients back to the server, my server output freezes in the evaluate_round
and there is also no response from the app (the client side). How can this be fixed?
Thanks in advance!
Hi @danieljanes,
I am using flower on android and I noticed that the flower server doesn't send the android clients the model instead the model is pre-built on the client device. How to resolve this?
Also, Currently, after the client finishes training 'n' number of rounds and returns the losses and metrics to server, the terminal running the server script terminates. The next time we run the server script, I am not sure whether the clients get the updated model weights from the previous FL training. Why is this happening and how to solve this?