adap / flower

Flower: A Friendly Federated Learning Framework
https://flower.ai
Apache License 2.0
4.45k stars 786 forks source link

Client cannot connect via gRPC #2962

Open PaulKMandal opened 4 months ago

PaulKMandal commented 4 months ago

Describe the bug

Client gets a gRPC error when trying to connect on flwr version 0.17.0

Steps/Code to Reproduce

My code is available here: https://github.com/PaulKMandal/flower_cv/tree/main

Expected Results

The model should begin training

Actual Results

I get the following error:

Traceback (most recent call last):
  File "/home/paul/Research/flower_cv/client.py", line 31, in <module>
    fl.client.start_client(server_address="[::]:8080", client=ObjectDetectionClient())
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/flwr/client/app.py", line 248, in start_client
    _start_client_internal(
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/flwr/client/app.py", line 361, in _start_client_internal
    message = receive()
              ^^^^^^^^^
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/flwr/client/grpc_client/connection.py", line 132, in receive
    proto = next(server_message_iterator)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/grpc/_channel.py", line 540, in __next__
    return self._next()
           ^^^^^^^^^^^^
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/grpc/_channel.py", line 966, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNKNOWN: ipv6:%5B::%5D:8080: Failed to connect to remote host: Connection refused"
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-02-15T12:42:51.805677812-06:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv6:%5B::%5D:8080: Failed to connect to remote host: Connection refused"}"
danieljanes commented 4 months ago

Hi @PaulKMandal, are the server plus and (at least) two clients running?

I'd strongly recommend upgrading to the latest Flower release (1.7 as of today). I had to check the release log to be sure, Flower 0.17.0 was released in 2021.

This guide helps with upgrading from pre-1.0 to 1.0+ releases: https://flower.ai/docs/framework/how-to-upgrade-to-flower-1.0.html

PaulKMandal commented 4 months ago

Hi @PaulKMandal, are the server plus and (at least) two clients running?

I'd strongly recommend upgrading to the latest Flower release (1.7 as of today). I had to check the release log to be sure, Flower 0.17.0 was released in 2021.

This guide helps with upgrading from pre-1.0 to 1.0+ releases: https://flower.ai/docs/framework/how-to-upgrade-to-flower-1.0.html

I was only testing with one client running. I have tested it with two clients and I now get the following error:

TypeError: ObjectDetectionClient.get_parameters() takes 1 positional argument but 2 were given
DEBUG flwr 2024-02-16 12:07:29,243 | connection.py:220 | gRPC channel closed
Traceback (most recent call last):
  File "/home/paul/Research/flower_cv/client.py", line 31, in <module>
    fl.client.start_client(server_address="[::]:8080", client=ObjectDetectionClient())
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/flwr/client/app.py", line 248, in start_client
    _start_client_internal(
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/flwr/client/app.py", line 361, in _start_client_internal
    message = receive()
              ^^^^^^^^^
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/flwr/client/grpc_client/connection.py", line 132, in receive
    proto = next(server_message_iterator)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/grpc/_channel.py", line 540, in __next__
    return self._next()
           ^^^^^^^^^^^^
  File "/home/paul/Research/flower_cv/venv/lib/python3.11/site-packages/grpc/_channel.py", line 966, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-02-16T12:07:29.037668125-06:00", grpc_status:14, grpc_message:"Socket closed"}"

I will try upgrading later, but I don't want to rewrite my entire implementation yet.

raminduw200 commented 3 months ago

I'm using version 1.7.0 of flwr, but I'm still encountering this error. It works fine locally, but when the server hosted on an AWS EC2 cluster, I get this error on each client running on the same machine. I've opened ports 8080, 9091, 9092, and 9093 on EC2. Clients connect and train successfully, but this error occurs at the end of training.

Traceback (most recent call last):
  File "client_cyclegan.py", line 131, in <module>
    fl.client.start_client(server_address="<EC2 Public IP>:8080", client=FlwrClient(opt).to_client())
  File "/home/ramindu/miniconda3/envs/FedCycleGAN/lib/python3.8/site-packages/flwr/client/app.py", line 248, in start_client
    _start_client_internal(
  File "/home/ramindu/miniconda3/envs/FedCycleGAN/lib/python3.8/site-packages/flwr/client/app.py", line 361, in _start_client_internal
    message = receive()
  File "/home/ramindu/miniconda3/envs/FedCycleGAN/lib/python3.8/site-packages/flwr/client/grpc_client/connection.py", line 132, in receive
    proto = next(server_message_iterator)
  File "/home/ramindu/miniconda3/envs/FedCycleGAN/lib/python3.8/site-packages/grpc/_channel.py", line 542, in __next__
    return self._next()
  File "/home/ramindu/miniconda3/envs/FedCycleGAN/lib/python3.8/site-packages/grpc/_channel.py", line 968, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "Socket closed"
    debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-03-19T08:17:21.62228519+05:30", grpc_status:14, grpc_message:"Socket closed"}"
>
GabriJP commented 3 months ago

I believe to have had the same problem. For some reason, the server only creates an IPv6 socket.

For me the solution was to completely disable IPv6 system-wide on the server machine.

You can easily check if this is your problem by running something like netplan -tulpn.