grpc connection closes immediately from client side when server is on AWS, but not in local docker

nicolascaiitec commented 1 month ago

What is your question?

I have a docker container running locally that contains the server.

The clients runs locally on my host machine and when I connect the client to the server it works normally. The fit, aggregation, etc.. all the rounds are fine.

But when putting the docker container running on AWS ECS (a service of ECS), the server is on listening, and then I try to connect 1 client. The client immediately closes the connection without error:

DEBUG:flwr:Opened secure gRPC connection using certificates
DEBUG:flwr:ChannelConnectivity.IDLE
DEBUG:flwr:ChannelConnectivity.CONNECTING
DEBUG:flwr:ChannelConnectivity.READY
DEBUG:flwr:gRPC channel closed
INFO :      Disconnect and shut down
INFO:flwr:Disconnect and shut down

The server does not have any log. Just listening.

When I try to connect the client to the server on AWS/ECS, but without passing certificates, it fails and there is a log on AWS/ECS: E0000 00:00:1727877451.667359 36 ssl_transport_security.cc:1654] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER

This suggests that the client is indeed able to communicate with the AWS server. But somehow, when passing certificates, immediately closes without any error.

I tried also not to require certificates from server and not pass them from client. The output is the same.

This was not happening before I made some major changes in the code, from flwr 1.3.0, I upgraded to flwr 1.9.0. I also tried the latest flwr 1.11.1. The same output.

I do not know how to fix this. It was working just fine with flwr 1.3.0. Then I also did some refactoring to adapt to the classes of flwr 1.9.0 (fit, fitresponse, evaluate, evaluateResponse, etc..). It works fine in a docker image/container that runs locally. This same image is the same as in AWS/ECS.

Cannot understand what is wrong.

Any help is appreciated.

p.s.

Let me add some details:

The server keeps on listening, even though 1 client immediately closes the connection (minimum clients required for federation is 2 in this case).
It works fine with flwr 1.3.0, without touching any configuration/network settings in AWS ECS. The code upgrade from flwr 1.3.0 to 1.9.0 changed nothing from networking logics. I have just adapted the classes of fit and evaluate to take the new classes FitRes, EvaluateRes, FitIns, EvaluateIns, Parameters, etc.. Other than this there is only the change that
fl.client.start_numpy_client (flwr 1.3.0) ----> now using fl.client.start_client

Other than this there are not any network and config changes.

PaulaDelgado-Santos commented 1 month ago

Hi, I am facing the same problem! Thanks

Robert-Steiner commented 1 month ago

Hey @nicolascaiitec, great to see that you want to run the server on AWS ECS.

To resolve the issue, I need some more information:

What Docker images are you using? Are you using the official flwr images?
Are you persisting the state or using the in-memory database?
Did you generate the certificates using the script at https://github.com/adap/flower/blob/main/dev/certificates/generate.sh?

To help troubleshoot the issue, you can try enabling gRPC trace logs by following the instructions in this link: https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md#grpc_trace

nicolascaiitec commented 1 month ago

@PaulaDelgado-Santos I have actually solved by downgrading flwr to 1.6 version. I tried various versions, I needed xg boosting, and flwr 1.6 was a good compromise.

@Robert-Steiner I am not using official images, I just have a backend environment hosted with the following specs:

dockerfile:

FROM public.ecr.aws/docker/library/python:3.10.13-bookworm

COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

This contains the server running. The clients have the same requirements.txt (with flwr 1.6.0)

with requirementst.txt having flwr==1.11.0 (but I solved now with flwr 1.6.0, the connection does not close anymore immediately from the client side without returning any error)

I think it is a problem maybe related to the version of grpcio .

The certificates did not influence the problem. I tried to remove the requirements of the certificates from the server and client., the problem was still there when deployed in AWS, while in local worked just fine.

adap / flower

grpc connection closes immediately from client side when server is on AWS, but not in local docker #4279

What is your question?