NVIDIA / NVFlare

NVIDIA Federated Learning Application Runtime Environment
https://nvidia.github.io/NVFlare/
Apache License 2.0
627 stars 179 forks source link

[BUG] POC prepare doesn't use overseer_agent sp_end_point ports from input project.yml #2783

Closed parkeraddison closed 2 months ago

parkeraddison commented 2 months ago

Describe the bug When running nvflare poc prepare -i project.yml, the builders.args.overseer_agent.args.sp_end_point value for a DummyOverseerAgent is not reflected in the provisioned fed_server.json, fed_client.json, fed_admin.json files. This means that even if you change the admin and fed_learn ports in the project yaml and endpoint, the POC processes still try connecting to the default 8003/8002 ports.

To Reproduce Steps to reproduce the behavior:

  1. Create a project.yml based off of the default POC config, but change the admin and fed_learn ports to 8005 and 8004:
    - sp_end_point: server:8003:8002
    + sp_end_point: server:8005:8004
    - admin_port: 8003
    + admin_port: 8005
    - fed_learn_port: 8002
    + fed_learn_port: 8004
    api_version: 3
    builders:
    - args:
        template_file:
        - master_template.yml
        - aws_template.yml
        - azure_template.yml
      path: nvflare.lighter.impl.workspace.WorkspaceBuilder
    - path: nvflare.lighter.impl.template.TemplateBuilder
    - args:
        config_folder: config
        overseer_agent:
          args:
            sp_end_point: server:8004:8005
          overseer_exists: false
          path: nvflare.ha.dummy_overseer_agent.DummyOverseerAgent
      path: nvflare.lighter.impl.static_file.StaticFileBuilder
    - path: nvflare.lighter.impl.cert.CertBuilder
    - path: nvflare.lighter.impl.signature.SignatureBuilder
    description: NVIDIA FLARE sample project yaml file
    name: example_project
    participants:
    - admin_port: 8005
      fed_learn_port: 8004
      name: server
      org: nvidia
      type: server
    - name: admin@nvidia.com
      org: nvidia
      role: project_admin
      type: admin
    - name: site-1
      org: nvidia
      type: client
    - name: site-2
      org: nvidia
      type: client
  2. Run nvflare poc prepare -i project.yml with that file
  3. Go to the provisioned file poc/example_project/prod_00/server/startup/fed_server.json and notice that the target and admin ports are properly set to 8004 and 8005, but the overseer_agent args still use sp_end_point: "localhost:8002:8003".
    {
      "format_version": 2,
      "servers": [
        {
          "name": "example_project",
          "service": {
            "target": "localhost:8004",
            "scheme": "grpc"
          },
          "admin_host": "localhost",
          "admin_port": 8005,
          "ssl_private_key": "server.key",
          "ssl_cert": "server.crt",
          "ssl_root_cert": "rootCA.pem"
        }
      ],
      "overseer_agent": {
        "args": {
          "sp_end_point": "localhost:8002:8003"
        },
        "path": "nvflare.ha.dummy_overseer_agent.DummyOverseerAgent"
      }
    }
  4. The same overseer_agent sp_end_point can be seen in admin/startup/fed_admin.json or site/startup/fed_client.json
  5. If you continue to launch the POC nvflare poc start, the participants will try and fail to connect over the old 8002/8003 ports. This leads to a login error and the following logs:
    # nvflare poc start
    WORKSPACE set to /Users/paddison/repos/FedRAG/outputs/poc/example_project/prod_00/server/startup/..
    PYTHONPATH is /local/custom:
    WORKSPACE set to /Users/paddison/repos/FedRAG/outputs/poc/example_project/prod_00/site-1/startup/..
    PYTHONPATH is /local/custom:
    WORKSPACE set to /Users/paddison/repos/FedRAG/outputs/poc/example_project/prod_00/site-2/startup/..
    PYTHONPATH is /local/custom:
    start fl because of no pid.fl
    new pid 34115
    Trying to obtain server address
    Obtained server address: localhost:8003
    Trying to login, please wait ...
    start fl because of no pid.fl
    new pid 34133
    2024-08-09 11:34:46,011 - nvflare.private.fed.app.deployer.server_deployer.ServerDeployer - INFO - server heartbeat timeout set to 600
    2024-08-09 11:34:46,155 - CoreCell - INFO - server: creating listener on grpc://0:8004
    2024-08-09 11:34:46,186 - CoreCell - INFO - server: created backbone external listener for grpc://0:8004
    2024-08-09 11:34:46,187 - ConnectorManager - INFO - 34115: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
    2024-08-09 11:34:46,188 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:11825] is starting
    start fl because of no pid.fl
    new pid 34142
    Trying to login, please wait ...
    Waiting for SP....
    2024-08-09 11:34:46,693 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:11825
    2024-08-09 11:34:46,693 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE grpc://0:8004] is starting
    2024-08-09 11:34:46,694 - nvflare.private.fed.app.deployer.server_deployer.ServerDeployer - INFO - deployed FLARE Server.
    2024-08-09 11:34:46,706 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 8005
    2024-08-09 11:34:46,706 - root - INFO - Server started
    2024-08-09 11:34:46,709 - nvflare.fuel.f3.drivers.grpc_driver.Server - INFO - added secure port at 0.0.0.0:8004
    2024-08-09 11:34:46,909 - CoreCell - INFO - site-1: created backbone external connector to grpc://localhost:8002
    2024-08-09 11:34:46,909 - ConnectorManager - INFO - 34133: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
    2024-08-09 11:34:46,912 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:25585] is starting
    2024-08-09 11:34:47,415 - CoreCell - INFO - site-1: created backbone internal listener for tcp://localhost:25585
    2024-08-09 11:34:47,416 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE grpc://localhost:8002] is starting
    2024-08-09 11:34:47,416 - FederatedClient - INFO - Wait for engine to be created.
    2024-08-09 11:34:47,424 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - created secure channel at localhost:8002
    2024-08-09 11:34:47,424 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 N/A => localhost:8002] is created: PID: 34133
    2024-08-09 11:34:47,434 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 Not Connected] is closed PID: 34133
    2024-08-09 11:34:47,434 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00002 Not Connected]
    Waiting for SP....

Expected behavior I would like to be able to change the POC overseer ports so that I can have developers running multiple POCs on the same machine using different non-conflicting ports based on having separate project yamls.

Screenshots See files/logs pasted above.

Desktop (please complete the following information):

Additional context N/A

chesterxgchen commented 2 months ago

good catch @parkeraddison, I will fix this.