adap / flower

Flower: A Friendly Federated AI Framework
https://flower.ai
Apache License 2.0
5.16k stars 882 forks source link

Server stops waiting for client evaluation results when both server-side and client-side evaluation are used #4519

Open MikeRiz521 opened 1 week ago

MikeRiz521 commented 1 week ago

Describe the bug

When evaluating the model both centrally in the server and federated in the clients, after server finishes evaluating the model for the current round and sends the evaluate message to the clients, it seems that after a specific amount of time the server abruptly closes the connection with the clients and proceeds to log the aggregate_evaluate steps as failures, and proceeds to the next round, while the client has nowhere to send the evaluation results

Steps/Code to Reproduce

implement both an evaluate_fn and an evaluate_metrics_aggregation_fn in the federated learning strategy method

Expected Results

The server to collect the distributed evaluation results from the clients, aggregate them according to the configured evaluate_metrics_aggregate_fn function, and then proceed with the next round of federated training.

Actual Results

This error appears after a while in all of the clients: I0000 00:00:1731929155.490662 16628 chttp2_transport.cc:1182] ipv4:127.0.0.1:8080: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {grpc_status:14, http2_error:11, created_time:"2024-11-18T13:25:55.490653401+02:00"} I0000 00:00:1731929155.491912 16628 chttp2_transport.cc:1182] ipv4:127.0.0.1:8080: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {created_time:"2024-11-18T13:25:55.491907782+02:00", http2_error:11, grpc_status:14}

After some experimentation, it seems to be a timing issue, if the client evaluation step takes a short amount of time, then it works as expected. However when running only client-side evaluation, this time limis is non-existent and evaluation can last for an hour or more. Therefore maybe there is a way to configure the gRPC setting of the server to keep the connection alive for longer?

adam-narozniak commented 3 days ago

Could you provide your full code you used?