HewlettPackard / swarm-learning

A simplified library for decentralized, privacy preserving machine learning
Apache License 2.0
331 stars 100 forks source link

Error: Unable to extract container id (with cgroup v1 on CentOS 8) #114

Closed maestro4 closed 2 years ago

maestro4 commented 2 years ago

Issue description

/usr/lib/python3.8/site-packages/urllib3/connection.py:460: SubjectAltNameWarning: Certificate for has no subjectAltName, falling back to check for a commonName for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/urllib3/urllib3/issues/497 for details.) warnings.warn( 2022-07-28 12:51:39,838 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 9632004996807340828 - Begins 2022-07-28 12:51:42,856 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 9632004996807340828 - Ends 2022-07-28 12:51:48,884 : swarm.swop : INFO : SWOPBuildTask: Validating profile 2022-07-28 12:51:55,063 : swarm.swop : ERROR : Unable to extract container id 2022-07-28 12:51:58,078 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 1 , Current Task : user_env_pyt_build_task , opId : 9632004996807340828 Done 2022-07-28 12:52:24,177 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11113303237863304723 - Begins 2022-07-28 12:52:27,196 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11113303237863304723 - Ends 2022-07-28 12:52:30,382 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 2 , Current Task : swarm_mnist_task , opId : 11113303237863304723 Done


SWCI:

SWCI:0 > ###################################################################### SWCI:0 > # (C)Copyright 2021,2022 Hewlett Packard Enterprise Development LP SWCI:0 > ###################################################################### SWCI:0 > SWCI:0 > # Assumption : SWOP is already running SWCI:0 > SWCI:0 > # SWCI context setup SWCI:0 > EXIT ON FAILURE SWCI:0 > EXIT ON FAILURE IS TURNED ON SWCI:1 > wait for API Server is UP! SWCI:2 > create context test-mnist API Server is UP! CONTEXT CREATED : test-mnist /usr/lib/python3.8/site-packages/urllib3/connection.py:455: SubjectAltNameWarning: Certificate for has no subjectAltName, falling back to check for a commonName for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/urllib3/urllib3/issues/497 for details.) warnings.warn( SWCI:3 > switch context test-mnist DEFAULT CONTEXT SET TO : test-mnist SWCI:4 > EXIT ON FAILURE OFF SWCI:4 > EXIT ON FAILURE IS TURNED OFF SWCI:5 > SWCI:5 > #Change to the directory where we are mounting the host SWCI:5 > cd /platform/swarm/usr SWCI:5 > Current Directory : /platform/swarm/usr SWCI:6 > SWCI:6 > # Create and finalize build task SWCI:6 > EXIT ON FAILURE SWCI:6 > EXIT ON FAILURE IS TURNED ON SWCI:7 > create task from taskdefs/user_env_pyt_build_task.yaml Task definition is valid Task Registered : user_env_pyt_build_task Appending Task Body batch start : 1 , len : 4 Successful batch start : 5 , len : 4 Successful batch start : 9 , len : 4 Successful batch start : 13 , len : 4 Successful batch start : 17 , len : 1 Successful Task creation Successful WARNING: Task should be finalized by user explicitly SWCI:8 > finalize task user_env_pyt_build_task Task Finalized SWCI:9 > get task info user_env_pyt_build_task NAME : user_env_pyt_build_task TASKTYPE : MAKE_USER_CONTAINER CREATETIME : 2022-07-28 12:51:12 AUTHOR : HPESwarm CONTENTLINES : 18 PREREQ : ROOTTASK OUTCOME : user-env-pyt1.5-swop FINALIZED : True SWCI:10 > get task body user_env_pyt_build_task 0000: --- 0001: BuildContext : sl-cli-lib 0002: BuildSteps : 0003: - FROM docker.io/pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime 0004: - 0005: - RUN apt-get update && apt-get install \ 0006: - build-essential python3-dev python3-pip \ 0007: - python3-setuptools --no-install-recommends -y 0008: - 0009: - RUN conda install pip ruamel.yaml 0010: - 0011: - RUN pip3 install --upgrade pip protobuf && pip3 install \ 0012: - matplotlib opencv-python pandas sklearn future 0013: - 0014: - RUN mkdir -p /tmp/hpe-swarmcli-pkg 0015: - COPY swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl 0016: - RUN pip3 install /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl 0017: BuildType : INLINE SWCI:11 > list tasks ROOTTASK user_env_pyt_build_task SWCI:12 > EXIT ON FAILURE OFF SWCI:12 > EXIT ON FAILURE IS TURNED OFF SWCI:13 > SWCI:13 > # Assign build task to taskrunner SWCI:13 > EXIT ON FAILURE SWCI:13 > EXIT ON FAILURE IS TURNED ON SWCI:14 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe TaskRunner Reset SWCI:15 > ASSIGN TASK user_env_pyt_build_task TO defaulttaskbb.taskdb.sml.hpe WITH 2 PEERS Task assigned to TaskRunner SWCI:16 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe WAITING FOR TASKRUNNER TO COMPLETE WAITING FOR TASKRUNNER TO COMPLETE WAITING FOR TASKRUNNER TO COMPLETE WAITING FOR TASKRUNNER TO COMPLETE TASKRUNNER FINISHED STATE : ERROR TIME : 2022-07-28 12:51:55 SWCI:17 > EXIT ON FAILURE OFF SWCI:17 > EXIT ON FAILURE IS TURNED OFF SWCI:18 > SWCI:18 > # Build task was already run. Now build and run swarm run tasks SWCI:18 > SWCI:18 > # Create and finalize swarm run task SWCI:18 > EXIT ON FAILURE SWCI:18 > EXIT ON FAILURE IS TURNED ON SWCI:19 > create task from taskdefs/swarm_mnist_task.yaml Task definition is valid Task Registered : swarm_mnist_task Appending Task Body batch start : 1 , len : 4 Successful batch start : 5 , len : 4 Successful batch start : 9 , len : 4 Successful batch start : 13 , len : 2 Successful Task creation Successful WARNING: Task should be finalized by user explicitly SWCI:20 > finalize task swarm_mnist_task Task Finalized SWCI:21 > get task info swarm_mnist_task NAME : swarm_mnist_task TASKTYPE : RUN_SWARM CREATETIME : 2022-07-28 12:52:00 AUTHOR : HPESwarm CONTENTLINES : 15 PREREQ : user_env_pyt_build_task OUTCOME : swarm_mnist_task FINALIZED : True SWCI:22 > get task body swarm_mnist_task 0000: --- 0001: Command : model/mnist_pyt.py 0002: Entrypoint : python3 0003: WorkingDir : /tmp/test 0004: PrivateContent : /tmp/test/data-and-scratch 0005: SharedContent : 0006: - Src : /home/smadan/git/swarm-learning/workspace/mnist-pyt/model 0007: Tgt : /tmp/test/model 0008: MType : BIND 0009: Envvars : 0010: - DATA_DIR : data-and-scratch/app-data 0011: - SCRATCH_DIR : data-and-scratch/scratch 0012: - MODEL_DIR : model 0013: - MAX_EPOCHS : 2 0014: - MIN_PEERS : 4 SWCI:23 > list tasks ROOTTASK user_env_pyt_build_task swarm_mnist_task SWCI:24 > EXIT ON FAILURE OFF SWCI:24 > EXIT ON FAILURE IS TURNED OFF SWCI:25 > SWCI:25 > # Assign run task SWCI:25 > EXIT ON FAILURE SWCI:25 > EXIT ON FAILURE IS TURNED ON SWCI:26 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe TaskRunner Reset SWCI:27 > ASSIGN TASK swarm_mnist_task TO defaulttaskbb.taskdb.sml.hpe WITH 4 PEERS Task assigned to TaskRunner SWCI:28 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe WAITING FOR TASKRUNNER TO COMPLETE WAITING FOR TASKRUNNER TO COMPLETE TASKRUNNER FINISHED STATE : ERROR TIME : 2022-07-28 12:52:29 SWCI:29 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe TaskRunner Reset SWCI:30 > EXIT ON FAILURE OFF SWCI:30 > EXIT ON FAILURE IS TURNED OFF SWCI:31 > SWCI:31 > # List and reset training contract SWCI:31 > EXIT ON FAILURE SWCI:31 > EXIT ON FAILURE IS TURNED ON SWCI:32 > LIST CONTRACTS defaultbb.cqdb.sml.hpe SWCI:33 > RESET CONTRACT defaultbb.cqdb.sml.hpe Contract Reset SWCI:34 > EXIT ON FAILURE OFF SWCI:34 > EXIT ON FAILURE IS TURNED OFF SWCI:35 > SWCI:35 > # Exit SWCI:35 > EXIT SWCI:35 > EXITING


# Swarm Learning Version:
- Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )

docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sn 1.0.0 0fbeb1e14459 3 months ago 1.23 GB hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swci 1.0.0 3c76a7bb4f87 3 months ago 1.07 GB hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swop 1.0.0 f0d463e98f17 3 months ago 953 MB hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sl 1.0.0 d1c9f233521e 3 months ago 1.2 GB


# OS and ML Platform
- details of host OS:

cat /etc/centos-release CentOS Linux release 8.5.2111


- details of ML platform used: pytorch
- details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): 2 machines, 2 SL nodes, 2 SN nodes

# Quick Checklist: Respond [Yes/No]
- APLS server web GUI shows available Licenses? Yes
- If Multiple systems are used, can each system access every other system? Yes
- Is Password-less SSH configuration setup for all the systems? Yes
- If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? Yes
- Is the user id a member of the docker group? yes

# Additional notes
- Are you running documented example without any modification? Almost, I additionally modified the IPs in SWOP profiles, added SWARM_LOG_LEVEL=DEBUG env variable to run_swop script, and also used workaround from https://github.com/HewlettPackard/swarm-learning/issues/103. 
h-ahmad commented 2 years ago

Thanks to Yoshio Sugiyama (IMOKURI). This problem has already been resolved in #103 . I solved my same problem using this solutions. Please close this issue to priorities the pending one. Thanks.

maestro4 commented 2 years ago

Thanks to Yoshio Sugiyama (IMOKURI). This problem has already been resolved in #103 . I solved my same problem using this solutions. Please close this issue to priorities the pending one. Thanks.

Actually, Yoshio Sugiyama (IMOKURI) asked me to create a new issue as the workaround from #103 doesn't work for me.

IMOKURI commented 2 years ago

I also tried on CentOS Stream 8 and could not reproduce the issue. (I did not use #103 work around.)

My SWOP log

image

Are you using CentOS 8 instead of CentOS Stream 8? (CentOS 8 is already EOL, so you might want to use another OS.)

What would be the result of the following command?

docker exec <Container Name of SWOP> cat /proc/self/cgroup 
maestro4 commented 2 years ago
$ docker exec swop1 cat /proc/self/cgroup
12:hugetlb:/
11:net_cls,net_prio:/
10:rdma:/
9:pids:/user.slice/user-1361.slice/session-2653.scope
8:blkio:/system.slice/sshd.service
7:cpuset:/
6:memory:/user.slice/user-1361.slice/session-2653.scope
5:perf_event:/
4:cpu,cpuacct:/
3:devices:/user.slice
2:freezer:/
1:name=systemd:/user.slice/user-1361.slice/user@1361.service/user.slice/podman-688920.scope/29a15e1074e18656d30438dd4acffe05f7da56d90a87e356929001d856bfab34

We are actually using podman and not docker on our systems. We do have /var/run/docker.sock in the containers and I could successfully test with curl the creation of containers through the socket.

We have also tried to use pull_image task with swarm_mnist_task. pull_image works successfully but swarm_mnist_task fails with "Unable to extract container id", even though the image is pulled correctly:

2022-07-29 12:35:50,762 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 11808671601640825250 - Begins
2022-07-29 12:35:53,782 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 11808671601640825250 - Ends
2022-07-29 12:35:53,798 : swarm.swop : INFO : SWOPDockerPullTask: Validating profile
2022-07-29 12:35:53,948 : swarm.swop : INFO : SWOPDockerPullTask: Profile validated
2022-07-29 12:35:56,961 : swarm.swop : INFO : SWOPDockerPullTask: Using Default login credentials
2022-07-29 12:35:59,976 : swarm.swop : INFO : SWOPDockerPullTask: Docker pull started
2022-07-29 12:36:07,994 : swarm.swop : INFO : SWOPDockerPullTask: Docker Pull Successful
2022-07-29 12:36:11,008 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 1 , Current Task : user_env_pyt_build_task , opId : 11808671601640825250 Done
2022-07-29 12:36:36,087 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11739075930308445596 - Begins
2022-07-29 12:36:39,105 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11739075930308445596 - Ends
2022-07-29 12:36:39,275 : swarm.swop : ERROR : Unable to extract container id
2022-07-29 12:36:42,289 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 2 , Current Task : swarm_mnist_task , opId : 11739075930308445596 Done
IMOKURI commented 2 years ago

Thanks for the logs.

I think swarm learning does not work with podman at this time.

If possible, could you please install docker and try swarm learning? (I think you can uninstall podman and buildah and install docker)

maestro4 commented 2 years ago

Unfortunately, in our organization all (GPU) systems are meant for multi-users. On these systems docker is not safe therefore our IT just allow podman.

Can I do something to make the swarm-learning library compatible with podman?

RadhakrishnaJ commented 2 years ago

Currently Swarm learning is not qualified on podman.

RadhakrishnaJ commented 2 years ago

Closing this issue, as the actual issue of extracting container ID is resolved in latest 1.1.0 release.