When evaluating the model both centrally in the server and federated in the clients, after server finishes evaluating the model for the current round and sends the evaluate message to the clients, it seems that after a specific amount of time the server abruptly closes the connection with the clients and proceeds to log the aggregate_evaluate steps as failures, and proceeds to the next round, while the client has nowhere to send the evaluation results
Steps/Code to Reproduce
implement both an evaluate_fn and an evaluate_metrics_aggregation_fn in the federated learning strategy method
Expected Results
The server to collect the distributed evaluation results from the clients, aggregate them according to the configured evaluate_metrics_aggregate_fn function, and then proceed with the next round of federated training.
Actual Results
This error appears after a while in all of the clients:
I0000 00:00:1731929155.490662 16628 chttp2_transport.cc:1182] ipv4:127.0.0.1:8080: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {grpc_status:14, http2_error:11, created_time:"2024-11-18T13:25:55.490653401+02:00"} I0000 00:00:1731929155.491912 16628 chttp2_transport.cc:1182] ipv4:127.0.0.1:8080: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {created_time:"2024-11-18T13:25:55.491907782+02:00", http2_error:11, grpc_status:14}
After some experimentation, it seems to be a timing issue, if the client evaluation step takes a short amount of time, then it works as expected. However when running only client-side evaluation, this time limis is non-existent and evaluation can last for an hour or more. Therefore maybe there is a way to configure the gRPC setting of the server to keep the connection alive for longer?
Describe the bug
When evaluating the model both centrally in the server and federated in the clients, after server finishes evaluating the model for the current round and sends the
evaluate message
to the clients, it seems that after a specific amount of time the server abruptly closes the connection with the clients and proceeds to log theaggregate_evaluate
steps as failures, and proceeds to the next round, while the client has nowhere to send the evaluation resultsSteps/Code to Reproduce
implement both an
evaluate_fn
and anevaluate_metrics_aggregation_fn
in the federated learning strategy methodExpected Results
The server to collect the distributed evaluation results from the clients, aggregate them according to the configured
evaluate_metrics_aggregate_fn
function, and then proceed with the next round of federated training.Actual Results
This error appears after a while in all of the clients:
I0000 00:00:1731929155.490662 16628 chttp2_transport.cc:1182] ipv4:127.0.0.1:8080: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {grpc_status:14, http2_error:11, created_time:"2024-11-18T13:25:55.490653401+02:00"} I0000 00:00:1731929155.491912 16628 chttp2_transport.cc:1182] ipv4:127.0.0.1:8080: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {created_time:"2024-11-18T13:25:55.491907782+02:00", http2_error:11, grpc_status:14}
After some experimentation, it seems to be a timing issue, if the client evaluation step takes a short amount of time, then it works as expected. However when running only client-side evaluation, this time limis is non-existent and evaluation can last for an hour or more. Therefore maybe there is a way to configure the gRPC setting of the server to keep the connection alive for longer?