Open hwpang opened 3 months ago
I was able to get this to work after swapping it to the default provision file generated by the newest nvflare. However, I encounter other problem. I am following step 4 in the federated learning demo to prepare clients: https://github.com/hwpang/medical-imaging/blob/main/federated-learning/README.md#4-prepare-clients. I run into the following error. Would appreciate any advice on how to resolve this.
PYTHONPATH is /local/custom:
start fl because of no pid.fl
new pid 7883
Waiting for SP....
2024-08-20 18:10:01,351 - CoreCell - INFO - FL-Asia-Hospital: created backbone external connector to grpc://server1:8002
2024-08-20 18:10:01,354 - ConnectorManager - INFO - 7883: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2024-08-20 18:10:01,377 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:55919] is starting
2024-08-20 18:10:01,706 - Communicator - INFO - Waiting for the client cell to be created.
2024-08-20 18:10:01,886 - CoreCell - INFO - FL-Asia-Hospital: created backbone internal listener for tcp://localhost:55919
2024-08-20 18:10:01,893 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE grpc://server1:8002] is starting
2024-08-20 18:10:01,899 - FederatedClient - INFO - Wait for engine to be created.
2024-08-20 18:10:01,905 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - created secure channel at server1:8002
2024-08-20 18:10:01,912 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 N/A => server1:8002] is created: PID: 7883
2024-08-20 18:10:01,972 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 Not Connected] is closed PID: 7883
2024-08-20 18:10:01,979 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00002 Not Connected]
2024-08-20 18:10:03,018 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - created secure channel at server1:8002
2024-08-20 18:10:03,024 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00003 N/A => server1:8002] is created: PID: 7883
2024-08-20 18:10:03,081 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00003 Not Connected] is closed PID: 7883
2024-08-20 18:10:03,088 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00003 Not Connected]
2024-08-20 18:10:05,128 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - created secure channel at server1:8002
2024-08-20 18:10:05,135 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00004 N/A => server1:8002] is created: PID: 7883
2024-08-20 18:10:05,196 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00004 Not Connected] is closed PID: 7883
2024-08-20 18:10:05,202 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00004 Not Connected]
2024-08-20 18:10:09,267 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - created secure channel at server1:8002
2024-08-20 18:10:09,273 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00005 N/A => server1:8002] is created: PID: 7883
2024-08-20 18:10:09,286 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00005 Not Connected] is closed PID: 7883
2024-08-20 18:10:09,292 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00005 Not Connected]
2024-08-20 18:10:09,298 - nvflare.fuel.f3.sfm.conn_manager - INFO - Retrying [CH00001 ACTIVE grpc://server1:8002] in 8 seconds
Exception in thread Thread-1 (_rnq_worker):
Traceback (most recent call last):
File "/anaconda/envs/nvflare_env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/anaconda/envs/nvflare_env/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/azureuser/NVFlare/nvflare/ha/dummy_overseer_agent.py", line 112, in _rnq_worker
self._do_callback()
File "/home/azureuser/NVFlare/nvflare/ha/dummy_overseer_agent.py", line 106, in _do_callback
self._update_callback(self)
File "/home/azureuser/NVFlare/nvflare/private/fed/client/fed_client_base.py", line 147, in overseer_callback
self.set_primary_sp(sp)
File "/home/azureuser/NVFlare/nvflare/private/fed/client/fed_client_base.py", line 362, in set_primary_sp
return self.set_sp(self._get_project_name(), sp)
File "/home/azureuser/NVFlare/nvflare/private/fed/client/fed_client_base.py", line 162, in set_sp
self._create_cell(location, scheme)
File "/home/azureuser/NVFlare/nvflare/private/fed/client/fed_client_base.py", line 220, in _create_cell
raise RuntimeError(f"Failed to get engine after {time.time()-start} seconds")
RuntimeError: Failed to get engine after 15.000278234481812 seconds
Hi,
Thanks for the great demo! I am following the instructions for the federated learning at https://github.com/hwpang/medical-imaging/blob/main/federated-learning/README.md.
I was able to follow through until
provision -p project.yml
step, where I encountered the following error:I would appreciate any advice on how to modify the config file to make it work for newer version of NVFLARE. Thanks!
Relevant information: