FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
418 stars 222 forks source link

ModuleNotFoundError: No module named 'federatedml' with docker-deploy #863

Open 0kuang opened 1 year ago

0kuang commented 1 year ago

I deploy FATE following 使用Docker Compose 部署 FATE

After deployment, I use the following command to enter the client container: docker exec -it confs-10000_client_1 bash

But when executing ./examples/benchmark_quality/homo_nn/fate-homo_nn.py the following error was reported:

Traceback (most recent call last):
  File "./fate-homo_nn.py", line 25, in <module>
    from federatedml.evaluation.metrics import classification_metric
ModuleNotFoundError: No module named 'federatedml'

How do I import the federatedml package in the client container?

Besides, I am a beginner and not familiar with the FATE framework. I would like to know how to use Python or jupyter to develop federated learning code in the case of docker deployment (for example, run the Resnet-example or building a custom dataset) instead of using the flow command.

Thanks!

zhihuiwan commented 1 year ago

environment needs to be imported before use:

source /data/projects/fate/bin/init_env.sh
0kuang commented 1 year ago
root@bf1b603f8015:/data/projects/fate# cd bin
bash: cd: bin: No such file or directory

My FATE version is v1.10.0

It seems that there is no such script.

owlet42 commented 1 year ago

I did a test and got the same error. This should be a bug in the client image. The client image does not fully test the examples. Dependent packages such as federatedml and fate_test are not included.

0kuang commented 1 year ago

How can I install these two packages manually?

zhihuiwan commented 1 year ago

You can try to set pythonpath and run it:

export PYTHONPATH=/data/projects/fate/fate/python
0kuang commented 1 year ago
root@ff9d37a0afb0:/data/projects/fate# cd /data/projects/fate/fate/python
bash: cd: /data/projects/fate/fate/python: No such file or directory

It seems that in the client container, the federatedml & python related folders are missing.

0kuang commented 1 year ago

I did a test and got the same error. This should be a bug in the client image. The client image does not fully test the examples. Dependent packages such as federatedml and fate_test are not included.

@owlet42

Sorry to bother you again, is there a way for me to manually install federatedml? I hope to continue my studies.

Thanks.

owlet42 commented 1 year ago

I did a test and got the same error. This should be a bug in the client image. The client image does not fully test the examples. Dependent packages such as federatedml and fate_test are not included.

@owlet42

Sorry to bother you again, is there a way for me to manually install federatedml? I hope to continue my studies.

Thanks.

@0kuang

A simple way is to add a volume mount for federatedml, and add the federatedml path to the PYTHONPATH environment variable.

图片

After I tried it, I found that there are other dependencies that need to be resolved.

0kuang commented 1 year ago

I solved the dependency problem as you said:

  1. set the PYTHONPATH
  2. Clone the code of the missing package in the github repo
  3. Copy a service_conf.yaml

Now I have a new problem, a new error occurs when executing pipeline.fit():

ValueError: job submit failed, err msg: {'jobId': '202303062227458326320', 'retcode': 103, 'retmsg': 'Traceback (most recent call last):
  File "/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py", line 142, in submit
    raise Exception("create job failed", response)
Exception: (\'create job failed\', {\'guest\': {9999: {\'retcode\': <RetCode.FEDERATED_ERROR: 104>, \'retmsg\': \'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\
\\tstatus = StatusCode.UNKNOWN\
\\tdetails = "\
[Roll Site Error TransInfo] \
 location msg=java.lang.String cannot be cast to java.lang.Integer \
 stack info=java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer\
\\tat scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)\
\\tat com.webank.eggroll.rollsite.Router$.query(Router.scala:80)\
\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:80)\
\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\
\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)\
\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\
\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\
\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\
\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\
\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\
\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814)\
\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\
\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\
\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\
\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\
\\tat java.lang.Thread.run(Thread.java:750)\
 \
\
exception trans path: rollsite(10000)"\
\\tdebug_error_string = "{"created":"@1678112871.934791845","description":"Error received from peer ipv4:192.167.0.5:9370","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"\\\
[Roll Site Error TransInfo] \\\
 location msg=java.lang.String cannot be cast to java.lang.Integer \\\
 stack info=java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer\\\
\\\\tat scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)\\\
\\\\tat com.webank.eggroll.rollsite.Router$.query(Router.scala:80)\\\
\\\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:80)\\\
\\\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\\\
\\\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)\\\
\\\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\\\
\\\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\\\
\\\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\\\
\\\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\\\
\\\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\\\
\\\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814)\\\
\\\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\\
\\\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\\
\\\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\\
\\\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\\
\\\\tat java.lang.Thread.run(Thread.java:750)\\\
 \\\
\\\
exception trans path: rollsite(10000)","grpc_status":2}"\
>\'}}, \'host\': {10000: {\'data\': {\'components\': {\'eval_0\': {\'need_run\': True}, \'nn_0\': {\'need_run\': True}, \'reader_0\': {\'need_run\': True}, \'reader_1\': {\'need_run\': True}}}, \'retcode\': 0, \'retmsg\': \'success\'}}, \'arbiter\': {10000: {\'data\': {\'components\': {\'eval_0\': {\'need_run\': True}, \'nn_0\': {\'need_run\': True}, \'reader_0\': {\'need_run\': False}, \'reader_1\': {\'need_run\': False}}}, \'retcode\': 0, \'retmsg\': \'success\'}}})
'}

I think the key lies in the rollsite, I don't know if it is helpful for you to judge.

# key
exception trans path: rollsite(10000)"\
\\tdebug_error_string = "{"created":"@1678112871.934791845","description":"Error received from peer ipv4:192.167.0.5:9370","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"\\\
[Roll Site Error TransInfo] \\\
 location msg=java.lang.String cannot be cast to java.lang.Integer \\\
 stack info=java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer\\\

thank you for your reply~

owlet42 commented 1 year ago

Please make sure that all components of your FATE are working properly and can complete unilateral and multilateral toy tests.

flow test toy -gid 9999 -hid 9999    # unilateral
flow test toy -gid 9999 -hid 10000   # multilateral
0kuang commented 1 year ago

I can now run the example code for Resnet with homo-nn correctly.

I would like to ask how to use GPU to accelerate training in FATE deployed by docker. Do you have any recommended tutorials?

In addition, which container will the task submitted through jupyter on confs_10000_client-1 eventually run on?

Thanks for your answer.

owlet42 commented 1 year ago

Currently does not support the deployment of GPU, the FATE task is mainly run in fateflow, the detailed process can refer to here https://federatedai.github.io/FATE-Flow/latest/fate_flow/

0kuang commented 1 year ago

Which deployment method supports GPU?

The FedAvgTrainer in the FATE framework supports cuda=True. Is this parameter useful?

owlet42 commented 1 year ago

FedAvgTrainer has this configuration, and you can try setting cuda=True to use GPU.