Open nicolascaiitec opened 1 month ago
Hi, I am facing the same problem! Thanks
Hey @nicolascaiitec, great to see that you want to run the server on AWS ECS.
To resolve the issue, I need some more information:
flwr
images?To help troubleshoot the issue, you can try enabling gRPC trace logs by following the instructions in this link: https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md#grpc_trace
@PaulaDelgado-Santos I have actually solved by downgrading flwr to 1.6 version. I tried various versions, I needed xg boosting, and flwr 1.6 was a good compromise.
@Robert-Steiner I am not using official images, I just have a backend environment hosted with the following specs:
dockerfile:
FROM public.ecr.aws/docker/library/python:3.10.13-bookworm
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
This contains the server running. The clients have the same requirements.txt (with flwr 1.6.0)
with requirementst.txt having flwr==1.11.0 (but I solved now with flwr 1.6.0, the connection does not close anymore immediately from the client side without returning any error)
I think it is a problem maybe related to the version of grpcio .
The certificates did not influence the problem. I tried to remove the requirements of the certificates from the server and client., the problem was still there when deployed in AWS, while in local worked just fine.
What is your question?
I have a docker container running locally that contains the server.
The clients runs locally on my host machine and when I connect the client to the server it works normally. The fit, aggregation, etc.. all the rounds are fine.
But when putting the docker container running on AWS ECS (a service of ECS), the server is on listening, and then I try to connect 1 client. The client immediately closes the connection without error:
The server does not have any log. Just listening.
When I try to connect the client to the server on AWS/ECS, but without passing certificates, it fails and there is a log on AWS/ECS: E0000 00:00:1727877451.667359 36 ssl_transport_security.cc:1654] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
This suggests that the client is indeed able to communicate with the AWS server. But somehow, when passing certificates, immediately closes without any error.
I tried also not to require certificates from server and not pass them from client. The output is the same.
This was not happening before I made some major changes in the code, from flwr 1.3.0, I upgraded to flwr 1.9.0. I also tried the latest flwr 1.11.1. The same output.
I do not know how to fix this. It was working just fine with flwr 1.3.0. Then I also did some refactoring to adapt to the classes of flwr 1.9.0 (fit, fitresponse, evaluate, evaluateResponse, etc..). It works fine in a docker image/container that runs locally. This same image is the same as in AWS/ECS.
Cannot understand what is wrong.
Any help is appreciated.
p.s.
Let me add some details:
The server keeps on listening, even though 1 client immediately closes the connection (minimum clients required for federation is 2 in this case).
It works fine with flwr 1.3.0, without touching any configuration/network settings in AWS ECS. The code upgrade from flwr 1.3.0 to 1.9.0 changed nothing from networking logics. I have just adapted the classes of fit and evaluate to take the new classes FitRes, EvaluateRes, FitIns, EvaluateIns, Parameters, etc.. Other than this there is only the change that
fl.client.start_numpy_client (flwr 1.3.0) ----> now using fl.client.start_client
Other than this there are not any network and config changes.