FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
423 stars 221 forks source link

v1.9.0 (FATE on Spark), deployment issues #778

Closed FranisiL closed 2 years ago

FranisiL commented 2 years ago

What deployment mode you are use? docker-compose

What KubeFATE and FATE version you are using? KubeFATE V1.9.0

MUST Please state the KubeFATE and FATE version you found the issue KubeFATE V1.9.0 FATE V1.9.0

What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS.

Desktop (please complete the following information):

To Reproduce

  1. Download KubeFATE(V1.9.0): Download URL

  2. unzip file kubefate-docker-compose-v1.9.0.tar.gz,Generate deployment package

    • .env

      RegistryURI=
      TAG=1.9.0-release
      SERVING_TAG=2.1.6-release
      SSH_PORT=22
    • parties.conf

      #!/bin/bash
      
      user=fate
      dir=/home/work/fate/fate_1.9.0
      party_list=(10000 9999)
      party_ip_list=(192.168.1.1 192.168.1.2)
      serving_ip_list=(192.168.1.1 192.168.1.2)
      
      # Engines:
      # Computing : Eggroll, Spark, Spark_local
      computing=Spark
      # Federation: Eggroll(computing: Eggroll), Pulsar/RabbitMQ(computing: Spark/Spark_local)
      federation=RabbitMQ
      # Storage: Eggroll(computing: Eggroll), HDFS(computing: Spark), LocalFS(computing: Spark_local)
      storage=HDFS
      # Algorithm: Basic, NN
      algorithm=Basic
      # Device: IPCL, CPU
      device=CPU
      
      # spark and eggroll
      compute_core=4
      
      # default
      exchangeip=
      
      # modify if you are going to use an external db
      mysql_ip=mysql
      mysql_user=fate
      mysql_password=fate_dev
      mysql_db=fate_flow
      
      name_node=hdfs://namenode:9000
      
      # Define fateboard login information
      fateboard_username=admin
      fateboard_password=admin@2022
      
      # Define serving admin login information
      serving_admin_username=admin
      serving_admin_password=admin@2022
  3. Transfer the generated installation package to the corresponding server, after decompression, enter the conf-10000 directory

  4. Start 10000 party fate related services, use the command to start: docker-compose up -d

What happen?

ERROR: Duplicate mount points: [/home/work/fate/fate_1.9.0/confs-10000/shared_dir/data/datanode:/hadoop/dfs/data:rw, /home/work/fate/fate_1.9.0/confs-10000/shared_dir/data/datanode-2:/hadoop/dfs/data:rw]

Additional context

  1. Attempt 1: Delete the datanode-1 and datanode-2 services in the docker-compose.yml file and start the task:docker-compose up -d

    1. new error occurs:

      ERROR: client version 1.38 is too new. Maximum supported API version is 1.37
  2. Attempt 2: Modify the docker-compose.yml fileversion: "3.7" change into version: "3"

    1. new error occurs:

      ERROR: The Compose file './docker-compose.yml' is invalid because:services.fateflow.healthcheck value Additional properties are not allowed ('start_period' was unexpected)
    2. Delete docker-compose.yml file services.fateflow.healthcheck :start_period: 40s new error

      ERROR: Duplicate mount points: [/home/work/fate/fate_1.9.0/confs-10000/shared_dir/data/datanode:/hadoop/dfs/data:rw, /home/work/fate/fate_1.9.0/confs-10000/shared_dir/data/datanode-2:/hadoop/dfs/data:rw]
    3. Delete the datanode-1 and datanode-2 services in the docker-compose.yml file and start the task,task started successfully

JingChen23 commented 2 years ago
Transfer the generated installation package to the corresponding server, after decompression, enter the conf-10000 directory

Start 10000 party fate related services, use the command to start: docker-compose up -d

Are you doing this manually? our script https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/docker_deploy.sh should take care of the other party.

Please check this https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README.md

What is your docker compose version? My verified one is:

11:14:04 root@example ~ → docker-compose version
docker-compose version 1.23.2, build 1110ad01
docker-py version: 3.6.0
CPython version: 3.6.7
OpenSSL version: OpenSSL 1.1.0f  25 May 2017
FranisiL commented 2 years ago

$ docker-compose version

docker-compose version 1.26.0, build d4451659 docker-py version: 4.2.1 CPython version: 3.7.7 OpenSSL version: OpenSSL 1.1.0l 10 Sep 2019

JingChen23 commented 2 years ago

ERROR: client version 1.38 is too new. Maximum supported API version is 1.37

https://stackoverflow.com/a/59387918/7262146

FranisiL commented 2 years ago

ERROR: client version 1.38 is too new. Maximum supported API version is 1.37

https://stackoverflow.com/a/59387918/7262146

Yes, modify the docker-compose.yml fileversion: "3.7" for version: '3.6'ERROR: client version 1.38 is too new. Maximum supported API version is 1.37 got resolved.This also proves that the deployment script may need to be modified. Although this problem is solved, other problems still exist.

JingChen23 commented 2 years ago

This also proves that the deployment script may need to be modified.

I think telling the minimum docker verison can work too .

FranisiL commented 2 years ago

I think telling the minimum docker verison can work too .

this problems still exist.

ERROR: Duplicate mount points: [/home/work/fate/fate_1.9.0/confs-10000/shared_dir/data/datanode:/hadoop/dfs/data:rw, /home/work/fate/fate_1.9.0/confs-10000/shared_dir/data/datanode-2:/hadoop/dfs/data:rw]

JingChen23 commented 2 years ago

you need to clean up everything between 2 times of docker-compose deployment. You can check this comment: https://github.com/FederatedAI/KubeFATE/issues/752#issuecomment-1250478671