alces-software / gridware

tool for compilation and installation of applications and libraries from the Alces Gridware software library
0 stars 0 forks source link

Docker MPI fails to launch (permission denied) #18

Open ColonelPanics opened 7 years ago

ColonelPanics commented 7 years ago

On a Flight instance with feature/configure-docker enabled, the following is seen (after having created a memtester image):

[alces@login1(singularity-cluster) ~]$ alces gridware docker run --mpi=2 apps-memtester-4.3.0 memtester 1G 1
Executing 'alces/gridware-apps-memtester-4.3.0' with arguments 'memtester 1G 1'...

  >>> open /opt/gridware/docker/exports/.docker_temp_518603817: permission denied

alces gridware docker run: Job failed (exit status 1)

Permissions of the directory:

[alces@login1(singularity-cluster) ~]$ ls -ld /opt/gridware/docker/exports/
drwxrwsr-x 2 root root 6 Oct 12 09:13 /opt/gridware/docker/exports/

If these permissions are changed to:

[alces@login1(singularity-cluster) ~]$ ls -ld /opt/gridware/docker/exports/
drwxrwsrwx 2 root root 168 Oct 12 14:55 /opt/gridware/docker/exports/

Then the command seems to loop indefinitely waiting for slaves to be ready

[alces@login1(singularity-cluster) ~]$ alces gridware docker run --mpi=2 apps-memtester-4.3.0 memtester 1G 1
Executing 'alces/gridware-apps-memtester-4.3.0' with arguments 'memtester 1G 1'...

  >>> Loaded image: 79a6dcc8-af5b-11e7-8208-0245577679b0:latest
  >>> image 79a6dcc8-af5b-11e7-8208-0245577679b0 could not be accessed on a registry to record
  >>> its digest. Each node will access 79a6dcc8-af5b-11e7-8208-0245577679b0 independently,
  >>> possibly leading to different nodes running different
  >>> versions of the image.
  >>>
  >>> Since --detach=false was not specified, tasks will be created in the background.
  >>> In a future release, --detach=false will become the default.
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 1 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 1 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 1 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
  >>> 0 of 2 slaves ready
jamesremuscat commented 7 years ago

You need to build your own copy of the base image (alces gridware docker build --base) and share it with the cluster (alces gridware docker share base) in order for the MPI enhancements to be present in images.

We should probably push a new version of the base image to docker.io - not sure of the credentials / process for that though!

We should also consider what the correct solution to the permissions issue is - /opt/gridware used to have a chgrp -R gridware run on it, but we no longer do that (as part of adding userspace Gridware).

ColonelPanics commented 7 years ago

Pushed rebuild of gridware-base https://hub.docker.com/r/alces/gridware-base/.

If the ownership change is only needed for the docker exports could it be added as part of feature/configure-docker?

jamesremuscat commented 7 years ago

Presumably we want all cluster users - not just administrators, who traditionally would make up the gridware group - to be able to launch Gridware MPI jobs? If so then we need another solution as chgrp is insufficient.

The easy and quick fix is to make the /opt/gridware/docker/exports directory world-writable.

A more correct approach could be to leave the directory permissions as-is, and have a sudoers.d entry allowing all users to run sudo docker save -o ${cw_GRIDWARE_root}/docker/exports/* without password, and modifying the docker_share command in clusterware-services accordingly.