deephealthproject / docker-compss-runtime

2 stars 1 forks source link

ERROR:ssh: Could not resolve hostname deephealth_compss-worker_2: Name or service not known #2

Open kinow opened 2 years ago

kinow commented 2 years ago

Hi,

I followed the instructions from the README.md file, and got the Docker compose cluster working after #1

But running the example results in the error below.

kinow@ranma:/tmp/docker-compss-runtime$ docker-compose exec compss-master bash
(eddl_onnx_last) root@c6e014895899:~# cd pyeddl/third_party/compss_runtime/
(eddl_onnx_last) root@c6e014895899:~/pyeddl/third_party/compss_runtime# runcompss --lang=python --python_interpreter=python3 --project=linux-based/project.xml --resources=linux-based/resources.xml eddl_train_batch_compss.py
[  INFO] Using default execution type: compss

----------------- Executing eddl_train_batch_compss.py --------------------------

WARNING: COMPSs Properties file is null. Setting default values
[(778)    API]  -  Starting COMPSs Runtime v2.6.rc2003 (build 20200408-1126.rcbac84bafe556637e165de38764868ac68a8a75e)
Sleeping 30 seconds...
E:  uname_result(system='Linux', node='c6e014895899', release='5.4.0-120-generic', version='#136-Ubuntu SMP Fri Jun 10 13:40:48 UTC 2022', machine='x86_64', processor='x86_64')
Generating Random Table
---------------------------------------------
---------------------------------------------

None
CS with low memory setup
Model training...
Number of epochs:  1
Number of epochs for parameter syncronization:  1
Training epochs [ 1  -  1 ] ...
Num workers:  4
Num images per worker:  15000
Workers batch size:  250
[ERRMGR]  -  WARNING: There was an exception when initiating worker deephealth_compss-worker_4.
[ERRMGR]  -  WARNING: There was an exception when initiating worker deephealth_compss-worker_2.
                      Stack trace:
                      Stack trace:
                      es.bsc.compss.exceptions.InitNodeException: [START_CMD_ERROR]: Could not start the NIO worker in resource deephealth_compss-worker_4 through user .
                      es.bsc.compss.exceptions.InitNodeException: [START_CMD_ERROR]: Could not start the NIO worker in resource deephealth_compss-worker_2 through user .
                      OUTPUT:
                      OUTPUT:
                      ERROR:ssh: Could not resolve hostname deephealth_compss-worker_2: Name or service not known

                        at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:90)
                        at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:142)
                        at es.bsc.compss.nio.master.NIOWorkerNode.start(NIOWorkerNode.java:153)
                        at es.bsc.compss.types.resources.ResourceImpl.start(ResourceImpl.java:119)
                        at es.bsc.compss.scheduler.types.allocatableactions.StartWorkerAction$1.run(StartWorkerAction.java:109)
[ERRMGR]  -  ERROR:   [START_CMD_ERROR]: Could not start the NIO worker in resource deephealth_compss-worker_2 through user .
                      OUTPUT:
                      ERROR:ssh: Could not resolve hostname deephealth_compss-worker_2: Name or service not known
[ERRMGR]  -  Shutting down COMPSs...
                      ERROR:ssh: Could not resolve hostname deephealth_compss-worker_4: Name or service not known

                        at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:90)
                        at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:142)
                        at es.bsc.compss.nio.master.NIOWorkerNode.start(NIOWorkerNode.java:153)
                        at es.bsc.compss.types.resources.ResourceImpl.start(ResourceImpl.java:119)
                        at es.bsc.compss.scheduler.types.allocatableactions.StartWorkerAction$1.run(StartWorkerAction.java:109)
[(163161)    API]  -  Execution Finished
Shutting down the running process

Error running application

(eddl_onnx_last) root@c6e014895899:~/pyeddl/third_party/compss_runtime#

Thanks! -Bruno

kinow commented 2 years ago

The last time I used that --scale argument was a long time ago with the PBS Torque Docker image. Looks like now Docker Compose added a slug (that random hash appended to the name).

It makes it harder to use the --scale as in the documentation, since the master is not able to find the slave hosts.

Here's a diff that made the README instructions work (could work as replacement for #1)

diff --git a/docker-compose.yaml b/docker-compose.yaml
index 59bedae..ba31cc6 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -1,12 +1,29 @@
 version: '3.7'
 services:
-  compss-worker:
+  compss-worker_1:
     image: "bscppc/compss-deephealth-demo"
     command: ["-c", "/usr/sbin/sshd -D"]
-       
+    container_name: deephealth_compss-worker_1
+  compss-worker_2:
+    image: "bscppc/compss-deephealth-demo"
+    command: ["-c", "/usr/sbin/sshd -D"]
+    container_name: deephealth_compss-worker_2
+  compss-worker_3:
+    image: "bscppc/compss-deephealth-demo"
+    command: ["-c", "/usr/sbin/sshd -D"]
+    container_name: deephealth_compss-worker_3
+  compss-worker_4:
+    image: "bscppc/compss-deephealth-demo"
+    command: ["-c", "/usr/sbin/sshd -D"]
+    container_name: deephealth_compss-worker_4
+
   compss-master:
     image: "bscppc/compss-deephealth-demo"
     stdin_open: true
     tty: true
     depends_on:
-      - compss-worker
+      - compss-worker_1
+      - compss-worker_2
+      - compss-worker_3
+      - compss-worker_4
+    container_name: deephealth_compss-master_1

I tried using a single slave, but I think the master configuration is set to 4 workers, so I thought it easier to just add the four workers directly in docker-compose.yaml.

Thanks!