HewlettPackard / swarm-learning

A simplified library for decentralized, privacy preserving machine learning
Apache License 2.0
332 stars 100 forks source link

MNIST example issues #178

Closed renepetermann289 closed 1 year ago

renepetermann289 commented 1 year ago

Issue description I am working on Ubuntu 20.04.2 LTS server (virtual) and trying to run the MNIST example. However, the run does not succeed. Every time after creating the user-env-tf image it jumps out with an error. Unfortunately I can't find the error. It creates a strange volume which I can't quite explain but unfortunately I can't do anything with it. I don't know why I have the repo hpe_eval as well as hpe. Does anyone have an idea what the problem can be here.

SWCI logs

SWCI:0 > # Assumption : SWOP is already running
SWCI:0 > 
SWCI:0 > # SWCI context setup
SWCI:0 > EXIT ON FAILURE
SWCI:0 > EXIT ON FAILURE IS TURNED ON
SWCI:1 > wait for ip sn1
API Server is UP!
SWCI:2 > create context test-mnist with ip sn1
API Server is UP!
CONTEXT CREATED : test-mnist
/usr/lib/python3.8/site-packages/urllib3/connection.py:455: SubjectAltNameWarning: Certificate for sn1 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/urllib3/urllib3/issues/497 for details.) 
warnings.warn(
SWCI:3 > switch context test-mnist
DEFAULT CONTEXT SET TO : test-mnist
SWCI:4 > EXIT ON FAILURE OFF
SWCI:4 > EXIT ON FAILURE IS TURNED OFF
SWCI:5 > 
SWCI:5 > #Change to the directory where we are mounting the host
SWCI:5 > cd /platform/swarm/usr
SWCI:5 > Current Directory : /platform/swarm/usr
SWCI:6 > 
SWCI:6 > # Create and finalize build task
SWCI:6 > EXIT ON FAILURE
SWCI:6 > EXIT ON FAILURE IS TURNED ON
SWCI:7 > create task from taskdefs/user_env_tf_build_task.yaml
Task definition is valid
Task Registered : user_env_tf_build_task
Appending Task Body
batch start : 1 , len : 4 Successful
batch start : 5 , len : 4 Successful
batch start : 9 , len : 3 Successful
Task creation Successful
WARNING: Task should be finalized by user explicitly
SWCI:8 > finalize task user_env_tf_build_task
Task Finalized
SWCI:9 > get task info user_env_tf_build_task
NAME         : user_env_tf_build_task
TASKTYPE     : MAKE_USER_CONTAINER
CREATETIME   : 2023-06-20 10:20:33
AUTHOR       : HPESwarm
CONTENTLINES : 12
PREREQ       : ROOTTASK
OUTCOME      : user-env-tf2.7.0-swop
FINALIZED    : True
SWCI:10 > get task body user_env_tf_build_task
0000: ---
0001: BuildContext : sl-cli-lib
0002: BuildSteps   : 
0003:     - FROM tensorflow/tensorflow:2.7.0
0004:     -  
0005:     - RUN pip3 install --upgrade pip && pip3 install \
0006:     -    keras matplotlib opencv-python pandas protobuf==3.15.6 
0007:     -  
0008:     - RUN mkdir -p /tmp/hpe-swarmcli-pkg
0009:     - COPY swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
0010:     - RUN pip3 install /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
0011: BuildType : INLINE
SWCI:11 > list tasks
ROOTTASK
user_env_tf_build_task
SWCI:12 > EXIT ON FAILURE OFF
SWCI:12 > EXIT ON FAILURE IS TURNED OFF
SWCI:13 > 
SWCI:13 > # Assign build task to taskrunner
SWCI:13 > EXIT ON FAILURE
SWCI:13 > EXIT ON FAILURE IS TURNED ON
SWCI:14 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
TaskRunner Reset
SWCI:15 > ASSIGN TASK user_env_tf_build_task TO defaulttaskbb.taskdb.sml.hpe WITH 2 PEERS
Task assigned to TaskRunner
SWCI:16 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
WAITING FOR TASKRUNNER TO COMPLETE - Maximum wait time is : 120 mins
######################     
TASKRUNNER FINISHED
  STATE : COMPLETE
  TIME  : 2023-06-20 10:22:43
SWCI:17 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
TaskRunner Reset
SWCI:18 > EXIT ON FAILURE OFF
SWCI:18 > EXIT ON FAILURE IS TURNED OFF
SWCI:19 > 
SWCI:19 > # Build task was already run. Now build and run swarm run tasks
SWCI:19 > 
SWCI:19 > # Create and finalize swarm run task
SWCI:19 > EXIT ON FAILURE
SWCI:19 > EXIT ON FAILURE IS TURNED ON
SWCI:20 > create task from taskdefs/swarm_mnist_task.yaml
Task definition is valid
Task Registered : swarm_mnist_task
Appending Task Body
batch start : 1 , len : 4 Successful
batch start : 5 , len : 4 Successful
batch start : 9 , len : 4 Successful
Task creation Successful
WARNING: Task should be finalized by user explicitly
SWCI:21 > finalize task swarm_mnist_task
Task Finalized
SWCI:22 > get task info swarm_mnist_task
NAME         : swarm_mnist_task
TASKTYPE     : RUN_SWARM
CREATETIME   : 2023-06-20 10:22:47
AUTHOR       : HPESwarm
CONTENTLINES : 13
PREREQ       : user_env_tf_build_task
OUTCOME      : swarm_mnist_task
FINALIZED    : True
SWCI:23 > get task body swarm_mnist_task
0000: ---
0001: Command : model/mnist_tf.py
0002: Entrypoint : python3
0003: WorkingDir : /tmp/test
0004: PrivateContent : /tmp/test/
0005: SharedContent : 
0006:   - Src   : /opt/hpe/swarm-learning/workspace/mnist/model
0007:     Tgt   : /tmp/test/model
0008:     MType : BIND
0009: Envvars : 
0010:   - MODEL_DIR : model
0011:   - MAX_EPOCHS : 2
0012:   - MIN_PEERS : 2
SWCI:24 > list tasks
ROOTTASK
user_env_tf_build_task
swarm_mnist_task
SWCI:25 > EXIT ON FAILURE OFF
SWCI:25 > EXIT ON FAILURE IS TURNED OFF
SWCI:26 > 
SWCI:26 > # Assign run task
SWCI:26 > EXIT ON FAILURE
SWCI:26 > EXIT ON FAILURE IS TURNED ON
SWCI:27 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
TaskRunner Reset
SWCI:28 > ASSIGN TASK swarm_mnist_task TO defaulttaskbb.taskdb.sml.hpe WITH 2 PEERS
Task assigned to TaskRunner
SWCI:29 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
WAITING FOR TASKRUNNER TO COMPLETE - Maximum wait time is : 120 mins
####     
TASKRUNNER FINISHED
  STATE : ERROR
  TIME  : 2023-06-20 10:23:25
SWCI:29 > ERROR : Task has failed, check TASKRUNNER PEER STATUS for Error description
SWCI:30 > EXIT ON ERROR

SWOP logs

swarm.swop : INFO : Installing collected packages: networkx, swarmlearning
swarm.swop : INFO : Successfully installed networkx-3.1 swarmlearning-1.2.0
swarm.swop : INFO : WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
swarm.swop : INFO : Removing intermediate container b0250e3bd960
swarm.swop : INFO :  ---> 5231a1f350f7
swarm.swop : INFO : ID: sha256:5231a1f350f720bc3700646507d0aa099208dd542a1b6ba85c0136213608d8dc
swarm.swop : INFO : Successfully built 5231a1f350f7
swarm.swop : INFO : Successfully tagged user-env-tf2.7.0-swop:latest
swarm.swop : INFO : SWOPBuildTask: build task completed
swarm.swop : INFO : SWOPBuildTask: Stopping Task
swarm.swop : INFO : SWOPExecutor: Total Tasks: 1 , Current Task : user_env_tf_build_task , opId : 10585997127626414346 Done
swarm.swop : INFO : SWOPExecutor : Ready for Task Execution!
swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11405451597466441051 - Begins
swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11405451597466441051 - Ends
swarm.swop : INFO : Extracted container id and image info from /tmp/container_info_file file
swarm.swop : INFO : SWOPRunTask: Stopping Task
swarm.swop : INFO : SWOPExecutor: Total Tasks: 2 , Current Task : swarm_mnist_task , opId : 11405451597466441051 Done
swarm.swop : INFO : SWOPExecutor : Ready for Task Execution!

Repos

user-env-tf2.7.0-swop                                          latest    fca57948f407   20 minutes ago   1.74GB
hello-world                                                    latest    9c7a54a9a43c   6 weeks ago      13.3kB
hub.myenterpriselicense.hpe.com/hpe/swarm-learning/sn          2.0.0     3a8d96c5a618   2 months ago     1.32GB
hub.myenterpriselicense.hpe.com/hpe/swarm-learning/swop        2.0.0     832fa5593648   2 months ago     995MB
hub.myenterpriselicense.hpe.com/hpe/swarm-learning/swci        2.0.0     cfae4ad6f140   2 months ago     1.11GB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sn     1.2.0     cdd30100a28a   6 months ago     1.25GB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sl     1.2.0     22c827268131   6 months ago     1.22GB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swci   1.2.0     01d14d886b16   6 months ago     1.11GB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swop   1.2.0     228c57a1a270   6 months ago     992MB
tensorflow/tensorflow                                          2.7.0     b51f642475ab   19 months ago    1.31GB

Volumes

local     2fb992160a45b01ffda7371085e8d9024bac0cc542303f025097c7e7b2a92924
local     sl-cli-lib
iArpanPatel commented 1 year ago

Hi, Please provide SL and ML logs as well. Refer Swarm Learning Log Collector to collect logs.

renepetermann289 commented 1 year ago

Hey, i think I solved it, it was a problem with the minst_tf.py file. I got a "saved_model.pb" at the end without error message.

a1847979164 commented 1 year ago

Hi,Can I ask why my image pull failed?and what products I need to buy HPE, and I can continue to run the program.

iArpanPatel commented 1 year ago

@a1847979164 please open a new issue with full details about the error.

a1847979164 commented 1 year ago

i have a issue:Unable to find image 'hub.myenterpriselicense.hpe.com/hpe/swarm-learning/sn:2.0.0' locally 2.0.0: Pulling from hpe/swarm-learning/sn;and i don't download apls-9.14.zip

iArpanPatel commented 1 year ago

Closing this issue as the author @renepetermann289 has resolved the issue.

Elenmu commented 11 months ago

Hey, i think I solved it, it was a problem with the minst_tf.py file. I got a "saved_model.pb" at the end without error message.

How did you solve the problem? What exactly should be done? I may have the same problem.