FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
424 stars 221 forks source link

跑secureboost的pipeline卡着不动 #899

Open huocun-ant opened 1 year ago

huocun-ant commented 1 year ago

What deployment mode you are use?

  1. docker-compose;

What KubeFATE and FATE version you are using? kubefate-docker-compose-v1.11.1.tar.gz

MUST Please state the KubeFATE and FATE version you found the issue

What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS. CentOS-7

Desktop (please complete the following information):

To Reproduce

Clear how to reproduce your problem.

  1. docker deploy 两方 9999 10000
  2. 分别在上传数据
from pipeline.backend.pipeline import PipeLine

pipeline_upload = PipeLine().set_initiator(role='host', party_id=10000).set_roles(guest=10000)

partition = 4

dense_data_guest = {"name": "breast_hetero_guest", "namespace": f"experiment"}
dense_data_host = {"name": "breast_hetero_host", "namespace": f"experiment"}

import os

data_base = "/data/projects/fate/"

pipeline_upload.add_upload_data(file=os.path.join(data_base, "examples/data/breast_hetero_host.csv"),
                                table_name=dense_data_host["name"],
                                namespace=dense_data_host["namespace"],
                                head=1, partition=partition)

pipeline_upload.upload(drop=1)
from pipeline.backend.pipeline import PipeLine

pipeline_upload = PipeLine().set_initiator(role='guest', party_id=9999).set_roles(guest=9999)

partition = 4

dense_data_guest = {"name": "breast_hetero_guest", "namespace": f"experiment"}
dense_data_host = {"name": "breast_hetero_host", "namespace": f"experiment"}

import os

data_base = "/data/projects/fate/"
pipeline_upload.add_upload_data(file=os.path.join(data_base, "examples/data/breast_hetero_guest.csv"),
                                table_name=dense_data_guest["name"],             # table name
                                namespace=dense_data_guest["namespace"],         # namespace
                                head=1, partition=partition)               # data info

pipeline_upload.upload(drop=1)
  1. 9999为guest,执行脚本
from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HeteroSecureBoost
from pipeline.component import Intersection
from pipeline.component import Reader
from pipeline.interface import Data
from pipeline.component import Evaluation
from pipeline.interface import Model
from pipeline.runtime.entity import JobParameters

job_parameters = JobParameters(task_cores=16, task_parallelism=1, computing_partitions=8)

namespace = ''
guest=9999
host=10000

# data sets
guest_train_data = {"name": "breast_hetero_guest", "namespace": f"experiment{namespace}"}
host_train_data = {"name": "breast_hetero_host", "namespace": f"experiment{namespace}"}

guest_validate_data = {"name": "breast_hetero_guest", "namespace": f"experiment{namespace}"}
host_validate_data = {"name": "breast_hetero_host", "namespace": f"experiment{namespace}"}

# init pipeline
pipeline = PipeLine().set_initiator(role="guest", party_id=guest).set_roles(guest=guest, host=host,)

# set data reader and data-io

reader_0, reader_1 = Reader(name="reader_0"), Reader(name="reader_1")
reader_0.get_party_instance(role="guest", party_id=guest).component_param(table=guest_train_data)
reader_0.get_party_instance(role="host", party_id=host).component_param(table=host_train_data)
reader_1.get_party_instance(role="guest", party_id=guest).component_param(table=guest_validate_data)
reader_1.get_party_instance(role="host", party_id=host).component_param(table=host_validate_data)

data_transform_0, data_transform_1 = DataTransform(name="data_transform_0"), DataTransform(name="data_transform_1")

data_transform_0.get_party_instance(
    role="guest", party_id=guest).component_param(
    with_label=True, output_format="dense")
data_transform_0.get_party_instance(role="host", party_id=host).component_param(with_label=False)
data_transform_1.get_party_instance(
    role="guest", party_id=guest).component_param(
    with_label=True, output_format="dense")
data_transform_1.get_party_instance(role="host", party_id=host).component_param(with_label=False)

# data intersect component
intersect_0 = Intersection(name="intersection_0")
intersect_1 = Intersection(name="intersection_1")

# secure boost component
hetero_secure_boost_0 = HeteroSecureBoost(name="hetero_secure_boost_0",
                                          num_trees=3,
                                          task_type="classification",
                                          objective_param={"objective": "cross_entropy"},
                                          encrypt_param={"method": "Paillier"},
                                          tree_param={"max_depth": 3},
                                          complete_secure=True,
                                          validation_freqs=1)

# evaluation component
evaluation_0 = Evaluation(name="evaluation_0", eval_type="binary")

pipeline.add_component(reader_0)
pipeline.add_component(reader_1)
pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
pipeline.add_component(
    data_transform_1, data=Data(
        data=reader_1.output.data), model=Model(
        data_transform_0.output.model))
pipeline.add_component(intersect_0, data=Data(data=data_transform_0.output.data))
pipeline.add_component(intersect_1, data=Data(data=data_transform_1.output.data))
pipeline.add_component(hetero_secure_boost_0, data=Data(train_data=intersect_0.output.data,
                                                        validate_data=intersect_1.output.data))
pipeline.add_component(evaluation_0, data=Data(data=hetero_secure_boost_0.output.data))

pipeline.compile()
pipeline.fit(job_parameters)

print("fitting hetero secureboost done, result:")
print(pipeline.get_component("hetero_secure_boost_0").get_summary())

What happen?

Clear the unexpected response. 脚本一直在运行,不会停止,查看日志:

  1. guest的hetero_secure_boost_0最后打印:
    [INFO] [2023-06-28 10:33:48,033] [202306281032286493510] [27039:140568298440512] - [hetero_boosting.sync_stop_flag] [line:116]: sync stop flag to host, boosting_core round is 2
  2. guest的hetero_secure_boost_0最后打印:
    [INFO] [2023-06-28 10:33:48,084] [202306281032286493510] [1733:140250989434688] - [hetero_secureboost_host.predict] [line:192]: running prediction

    Screenshots image image image

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

owlet42 commented 1 year ago

I ran your task on my own environment and it ran without issue.

image

You check whether your environment has enough resource CPU and memory, and whether compute_core=16 is given an appropriate value when deploying?

huocun-ant commented 1 year ago

You check whether your environment has enough resource CPU and memory, and whether compute_core=16 is given an appropriate value when deploying?

It's correct.

What version is your environment? I reinstall 1.11.1, the problem still exists. I will try other versions.

owlet42 commented 1 year ago

I am using v1.11.1, please check the detailed log of the job in fateflow, and whether there are problems such as oom.