NVIDIA / NVFlare

NVIDIA Federated Learning Application Runtime Environment
https://nvidia.github.io/NVFlare/
Apache License 2.0
612 stars 174 forks source link

Docker container hello world error socket name resolution [BUG] #2377

Open rachelglenn opened 7 months ago

rachelglenn commented 7 months ago

I am trying to use a docker container to run examples in NVFLARE. I build the docker container by editing the one provided in the master branch of NFLARE.

https://github.com/NVIDIA/NVFlare/blob/main/docker/Dockerfile

I built the docker container and am running the container. I am trying to get the example hello-pt to run inside the docker container. podman run --rm -it --security-opt label=disable --gpus all -p 8888:8888 --ulimit stack=67108864 --device nvidia.com/gpu=all -v /workspace/:/workspace localhost/nvflare/nvflare /bin/bash

nvflare simulator -w /tmp/nvflare/test -n 2 -t 2 /workspace/NVFlare_example/jobs/hello-pt

024-02-23 12:22:37,484 - SimulatorRunner - INFO - Create the Simulator Server.
2024-02-23 12:22:37,486 - CoreCell - INFO - server: creating listener on tcp://0:38967
2024-02-23 12:22:37,510 - CoreCell - INFO - server: created backbone external listener for tcp://0:38967
2024-02-23 12:22:37,510 - ConnectorManager - INFO - 66: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2024-02-23 12:22:37,511 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:51937] is starting
2024-02-23 12:22:38,012 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:51937
2024-02-23 12:22:38,013 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:38967] is starting
2024-02-23 12:22:38,092 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 53769
2024-02-23 12:22:38,092 - SimulatorRunner - INFO - Deploy the Apps.
2024-02-23 12:22:38,101 - SimulatorRunner - INFO - Create the simulate clients.
2024-02-23 12:22:38,105 - ClientManager - INFO - Client: New client site-1@10.0.2.100 joined. Sent token: 30d9d61e-a6d2-4892-8706-11ed31417cb7.  Total clients: 1
2024-02-23 12:22:38,105 - FederatedClient - INFO - Successfully registered client:site-1 for project simulator_server. Token:30d9d61e-a6d2-4892-8706-11ed31417cb7 SSID:
2024-02-23 12:22:38,106 - ClientManager - INFO - Client: New client site-2@10.0.2.100 joined. Sent token: 826ff296-60ba-4b85-91e3-8fec007dcf20.  Total clients: 2
2024-02-23 12:22:38,106 - FederatedClient - INFO - Successfully registered client:site-2 for project simulator_server. Token:826ff296-60ba-4b85-91e3-8fec007dcf20 SSID:
2024-02-23 12:22:38,106 - SimulatorRunner - INFO - Set the client status ready.
2024-02-23 12:22:38,106 - SimulatorRunner - INFO - Deploy and start the Server App.
2024-02-23 12:22:38,107 - Cell - INFO - Register blob CB for channel='server_command', topic='*'
2024-02-23 12:22:38,108 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'
2024-02-23 12:22:38,108 - ServerCommandAgent - INFO - ServerCommandAgent cell register_request_cb: server.simulate_job
2024-02-23 12:22:40,378 - matplotlib.font_manager - INFO - generated new fontManager
2024-02-23 12:22:41,672 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job]: Server runner starting ...
2024-02-23 12:22:41,673 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job]: starting workflow pre_train (<class 'nvflare.app_common.workflows.initialize_global_weights.InitializeGlobalWeights'>) ...
2024-02-23 12:22:41,673 - InitializeGlobalWeights - INFO - [identity=simulator_server, run=simulate_job, wf=pre_train]: Initializing BroadcastAndProcess.
2024-02-23 12:22:41,673 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=pre_train]: Workflow pre_train (<class 'nvflare.app_common.workflows.initialize_global_weights.InitializeGlobalWeights'>) started
2024-02-23 12:22:41,674 - InitializeGlobalWeights - INFO - [identity=simulator_server, run=simulate_job, wf=pre_train]: scheduled task get_weights
2024-02-23 12:22:42,112 - SimulatorClientRunner - INFO - Start the clients run simulation.
2024-02-23 12:22:43,114 - SimulatorClientRunner - INFO - Simulate Run client: site-1 on GPU group: None
2024-02-23 12:22:43,114 - SimulatorClientRunner - INFO - Simulate Run client: site-2 on GPU group: None
2024-02-23 12:22:44,138 - ClientTaskWorker - INFO - ClientTaskWorker started to run
2024-02-23 12:22:44,145 - ClientTaskWorker - INFO - ClientTaskWorker started to run
2024-02-23 12:22:44,193 - CoreCell - INFO - site-1.simulate_job: created backbone external connector to tcp://localhost:38967
2024-02-23 12:22:44,194 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:38967] is starting
2024-02-23 12:22:44,194 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:46358 => 127.0.0.1:38967] is created: PID: 89
2024-02-23 12:22:44,195 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00005 127.0.0.1:38967 <= 127.0.0.1:46358] is created: PID: 66
2024-02-23 12:22:44,200 - CoreCell - INFO - site-2.simulate_job: created backbone external connector to tcp://localhost:38967
2024-02-23 12:22:44,200 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:38967] is starting
2024-02-23 12:22:44,201 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:46370 => 127.0.0.1:38967] is created: PID: 90
2024-02-23 12:22:44,201 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00006 127.0.0.1:38967 <= 127.0.0.1:46370] is created: PID: 66
2024-02-23 12:22:47,375 - JsonScanner - ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/local/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/usr/local/lib/python3.8/http/client.py", line 1418, in connect
    super().connect()
  File "/usr/local/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/usr/local/lib/python3.8/socket.py", line 787, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/local/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

``

YuanTingHsieh commented 7 months ago

Hi @rachelglenn thanks for your interest!

Did you run prepare_data.sh first? (bash ./prepare_data.sh)

If your docker container can't connect to outside network, you can download the data before you start your container. And then mount the data directory.

Be sure to modify the data root in https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-pt/jobs/hello-pt/app/custom/cifar10trainer.py#L40 and https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-pt/jobs/hello-pt/app/custom/cifar10validator.py#L31

I would actually suggest you go through these examples first: https://github.com/NVIDIA/NVFlare/tree/main/examples/hello-world/ml-to-fl/pt

chesterxgchen commented 7 months ago

@IsaacYangSLA can you help with some insight ?

YuanTingHsieh commented 5 months ago

@rachelglenn Can you try to run the hello-numpy-sag example inside and see if it works as well?