Permission Error when running docker-compose

trberg commented 5 years ago

So we are running into an issue where the command "docker-compose --verbose up" runs into a permissions issue, even when running as sudo:

workflow-hook_1  | [INFO] BUILD FAILURE
workflow-hook_1  | [INFO] ------------------------------------------------------------------------
workflow-hook_1  | [INFO] Total time:  4.398 s
workflow-hook_1  | [INFO] Finished at: 2019-06-14T23:25:23Z
workflow-hook_1  | [INFO] ------------------------------------------------------------------------
workflow-hook_1  | [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:java (default-cli) on project WorkflowHook: An exception occured while executing the Java class. null: InvocationTargetException: org.newsclub.net.unix.AFUNIXSocketException: Permission denied (socket: /run/docker.sock) -> [Help 1]

We find we can bypass this error by running the docker-compose in a privileged state. However, we then run into an other permission error further down the CWL pipeline when trying to pull in docker containers.

STDERR: 2019-06-13T21:33:19.533538167Z WARNING:toil.leader:d/T/jobIXsDkh    Traceback (most recent call last):
STDERR: 2019-06-13T21:33:19.533545557Z WARNING:toil.leader:d/T/jobIXsDkh      File "runDocker.py", line 157, in <module>
STDERR: 2019-06-13T21:33:19.533553710Z WARNING:toil.leader:d/T/jobIXsDkh        main(args)
STDERR: 2019-06-13T21:33:19.533561110Z WARNING:toil.leader:d/T/jobIXsDkh      File "runDocker.py", line 54, in main
STDERR: 2019-06-13T21:33:19.533568944Z WARNING:toil.leader:d/T/jobIXsDkh        for cont in client.containers.list(all=True):
STDERR: 2019-06-13T21:33:19.533576527Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/docker/models/containers.py", line 824, in list
STDERR: 2019-06-13T21:33:19.533586174Z WARNING:toil.leader:d/T/jobIXsDkh        since=since)
STDERR: 2019-06-13T21:33:19.533593970Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/docker/api/container.py", line 191, in containers
STDERR: 2019-06-13T21:33:19.533611794Z WARNING:toil.leader:d/T/jobIXsDkh        res = self._result(self._get(u, params=params), True)
STDERR: 2019-06-13T21:33:19.533620087Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/docker/utils/decorators.py", line 46, in inner
STDERR: 2019-06-13T21:33:19.533627987Z WARNING:toil.leader:d/T/jobIXsDkh        return f(self, *args, **kwargs)
STDERR: 2019-06-13T21:33:19.533635597Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/docker/api/client.py", line 189, in _get
STDERR: 2019-06-13T21:33:19.533643460Z WARNING:toil.leader:d/T/jobIXsDkh        return self.get(url, **self._set_request_timeout(kwargs))
STDERR: 2019-06-13T21:33:19.533651040Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 546, in get
STDERR: 2019-06-13T21:33:19.533658917Z WARNING:toil.leader:d/T/jobIXsDkh        return self.request('GET', url, **kwargs)
STDERR: 2019-06-13T21:33:19.533666414Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
STDERR: 2019-06-13T21:33:19.533674287Z WARNING:toil.leader:d/T/jobIXsDkh        resp = self.send(prep, **send_kwargs)
STDERR: 2019-06-13T21:33:19.533681737Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
STDERR: 2019-06-13T21:33:19.533689548Z WARNING:toil.leader:d/T/jobIXsDkh        r = adapter.send(request, **kwargs)
STDERR: 2019-06-13T21:33:19.533697004Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 498, in send
STDERR: 2019-06-13T21:33:19.533705168Z WARNING:toil.leader:d/T/jobIXsDkh        raise ConnectionError(err, request=request)
STDERR: 2019-06-13T21:33:19.533712621Z WARNING:toil.leader:d/T/jobIXsDkh    requests.exceptions.ConnectionError: ('Connection aborted.', error(13, 'Permission denied'))

We are using Redhat (which doesn't support docker-compose) for our OS and are running docker version 1.13.1.

Our reference evaluation pipeline is located here: https://github.com/Sage-Bionetworks/EHR-challenge and is correctly being pulled into the running pipeline.

We had this pipeline up and running at one point but had to restart the VM and now it's broken. The restart updated the OS and docker version but didn't radically change anything.

Any insight would be helpful to troubleshoot this issue.

Thank you

trberg commented 5 years ago

The workflow_shared volume was created in a privileged state and yet the non-privileged ubuntu container was able to access it, that is correct.

I'm not sure what to do with that....

jprosser commented 5 years ago

There's a bit of confusion here on the words too, when you say "workflow is creating directories in the container" - does that mean creating files in the mounted volume, from within a running container?

If you're creating files via the user running compose (trberg), that is going to run into problems.

brucehoff commented 5 years ago

Is there any way, in your environment, to do the following:

run a container that can access the Docker Engine (e.g., the Docker socket);
create a file in that container that can be accessed from within another container

? Or is that simply not possible?

If it's not possible then we can stop trying to get it to work and start pursuing other alternatives.

jprosser commented 5 years ago

Volumes should make that possible, I would expect running containers to be able to share volumes.

brucehoff commented 5 years ago

Putting aside the challenge and the workflow hook, can you create a working demo with two generic containers and a volume through which they share data? Again, one container must be able to access the Docker Engine.

jprosser commented 5 years ago

Ok, I'm not sure the best way to show this but I just ran through a simple test using a volume between a number of containers which worked just fine. Maybe I'll just toss in my terminal output here. Let me know if I should expand on this.

term.txt

brucehoff commented 5 years ago

@jprosser I read through term.txt but did not see your demonstration that the container was able to access the Docker engine. Could you please explain how you demonstrated this?

jprosser commented 5 years ago

Here's a run with a privileged container testing and then also with a user set to "bob" to show that scenario. So user root is id=0 which is the same everywhere but if you toss in a user, that will have an id of something else. If you want to share files, this id needs to match if not user root (id=0) and the permissions also need to allow as well. Hope this helps! priv.txt

I'm sure you'll note but just to call it out there are --privileged flags tossed in there too in the various runs and I hurried through looking for anything unexpected as I went, but did not notice anything amis seeing everything go as I expected based on unix permission, no selinux denies since I didn't use my home dir for anything but building an image.

brucehoff commented 5 years ago

@jprosser Thanks for doing this suite of tests. Do the results suggest what you can change when running the workflow hook to allow file sharing between containers to work? If not, how shall we proceed to investigate the issue?

jprosser commented 5 years ago

Yes, so first of all, using root as the user within the container will make things simpler when using one volume with multiple containers. The downside to this is the user/dev (this is a term I'm giving to the user login of the developer here) who runs docker is not root (in our environment) and doesn't have access to that volume(s) directly but can build images and run containers that do have access, so for the user/dev this is something to be aware of.

A Dockerfile that copies data from the user/dev homedir|cwd|somepath into the image is fine during docker build, but that is of course read only being an image in the end. Adding a volume gives a persistent area to write to and share between containers provided that all the activity is user root (id=0). Otherwise user and permission management must be handled directly.

For the user/dev to copy data into that volume, the "docker cp" command can be used, but this requires a running container with that volume, to make the operation possible as far as I know.

jprosser commented 5 years ago

Also, I haven't actually looked at this project yet and have just been helping Tim out. I hope to check it out and perhaps offer some suggestion or PRs if time allows and you have interest.

trberg commented 5 years ago

@jprosser could we try and figure out a way to run the docker-compose in a non-privileged state? I believe that would involve expanding the permissions on docker.sock but I'm not sure.

jprosser commented 5 years ago

If there's a need for a container to orchestrate, then I believe that privileged would be required, basically running docker in docker.

If we're just hung-up on getting data from the dev/user file system space into the container world, that could be solved with a Dockerfile that creates a data container which carries the data directly copied in during docker build. Or perhaps better is using that container as the copy tool into and out of a volume shared among various containers here.

brucehoff commented 5 years ago

running docker in docker

Commonly the phrase "running docker in docker" means running the docker Engine in a docker container, a practice that's advised against. Here we are merely running the Docker client in a container (the Docker Engine runs on the host), which is generally acceptable.

If we're just hung-up on getting data from the dev/user file system space into the container world ...

The error we are addressing is not related to moving data files into a container volume but rather sharing files between containers via a volume.

using root as the user within the container will make things simpler when using one volume with multiple containers

Yes, that is clear from the results of your recent experiments. It's not clear to me what you did when running the Synapse Workflow Hook (when the second container failed to access a file written by the first one, as shown here https://github.com/Sage-Bionetworks/SynapseWorkflowHook/issues/45#issuecomment-509819602).

STDERR: 2019-07-09T21:31:28.334908559Z OSError: [Errno 13] Permission denied: '/var/lib/docker/volumes/synapseworkflowhook_shared/_data/182337f7-8533-4f42-8ecc-e4a5e3a3b3cc/EHR-challenge-master/docker_agent_workflow.cwl'

Were you not using 'root' as the user in the containers?

trberg commented 5 years ago

Were you not using 'root' as the user in the containers?

I went and checked on this. When we spin up the workflow container, here is the "top" results.

UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                5166                5147                1                   11:55               ?                   00:00:18            /usr/local/openjdk-11/bin/java -classpath /usr/share/maven/boot/plexus-classworlds-2.6.0.jar -Dclassworlds.conf=/usr/share/maven/bin/m2.conf -Dmaven.home=/usr/share/maven -Dlibrary.jansi.path=/usr/share/maven/lib/jansi-native -Dmaven.multiModuleProjectDirectory=/ org.codehaus.plexus.classworlds.launcher.Launcher exec:java -DentryPoint=org.sagebionetworks.WorkflowHook

It seems we are running in the container as root at least here

jprosser commented 5 years ago

That pid is running right now and is running on the host and is not in a container. I don't know off hand how that is possible though I can image that having access to docker permissions would given an avenue.

jprosser commented 5 years ago

Ok, sorry, that wasn't the case, I just got in a bit of a hurry there.

jprosser commented 5 years ago

Here's what I see right on that root process in the process tree of the host:

├─dockerd-current─┬─docker-containe─┬─docker-containe─┬─java───19*[{java}] │ │ │ └─9*[{docker-containe}] │ │ └─12*[{docker-containe}] │ └─12*[{dockerd-current}]

trberg commented 5 years ago

@brucehoff I notice the Docker.toil file that at the end of the file we have the following:

WORKDIR /workdir

Since in in the docker-compose.yaml file we are setting the volumes as such:

 volumes:
    - shared:/shared:rw
    - /var/run/docker.sock:/var/run/docker.sock

Could this be causing an issue? Should that WORKDIR be /shared?

brucehoff commented 5 years ago

Could this be causing an issue?

No: "The WORKDIR instruction sets the working directory for any RUN, CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the Dockerfile." from: https://docs.docker.com/engine/reference/builder/#workdir

Since the WORKDIR line is the last line in Dockerfile.Toil it has no effect. We will remove it to avoid future confusion. Additionally when we run Toil (using this container image) we include the workdir option which indeed points inside the shared volume and which overrides the WORKDIR in the Dockerfile: https://docs.docker.com/engine/reference/run/#workdir

You can verify my claim if you have a Toil container remaining on your system from a previous run (even if it's stopped) by running docker inspect on the container and looking at the setting for the working directory.

brucehoff commented 5 years ago

At this point it's not clear to me what the difference is between the manual experiment showing two containers sharing a file through a volume and the Workflow Hook failing to do the same thing in your environment. Is it clear to you what the next step is or do we need to put our heads together to decide what to do next?

trberg commented 5 years ago

Yeah, lets get together and brainstorm, we're out of ideas on our end.

brucehoff commented 5 years ago

Note: To run the workflow hook without Docker Compose:

export DOCKER_ENGINE_URL=unix:///var/run/docker.sock
export SYNAPSE_USERNAME=xxxxx
export SYNAPSE_PASSWORD=xxxxx
export WORKFLOW_OUTPUT_ROOT_ENTITY_ID=synXXXXX
export TOIL_CLI_OPTIONS="--defaultMemory 100M --retryCount 0 --defaultDisk 1000000"
export EVALUATION_TEMPLATES={"xxxxx":"synXXXXX"}
export MAX_CONCURRENT_WORKFLOWS=2
export SUBMITTER_NOTIFICATION_MASK=28
export COMPOSE_PROJECT_NAME=workflow_orchestrator

docker volume create ${COMPOSE_PROJECT_NAME}_shared

docker pull sagebionetworks/synapseworkflowhook

docker run -v ${COMPOSE_PROJECT_NAME}_shared:/shared:rw -v /var/run/docker.sock:/var/run/docker.sock:rw \
-e DOCKER_ENGINE_URL=${DOCKER_ENGINE_URL} \
-e SYNAPSE_USERNAME=${SYNAPSE_USERNAME} \
-e SYNAPSE_PASSWORD=${SYNAPSE_PASSWORD} \
-e WORKFLOW_OUTPUT_ROOT_ENTITY_ID=${WORKFLOW_OUTPUT_ROOT_ENTITY_ID} \
-e EVALUATION_TEMPLATES=${EVALUATION_TEMPLATES} \
-e NOTIFICATION_PRINCIPAL_ID=${NOTIFICATION_PRINCIPAL_ID} \
-e SHARE_RESULTS_IMMEDIATELY=${SHARE_RESULTS_IMMEDIATELY} \
-e DATA_UNLOCK_SYNAPSE_PRINCIPAL_ID=${DATA_UNLOCK_SYNAPSE_PRINCIPAL_ID} \
-e TOIL_CLI_OPTIONS="${TOIL_CLI_OPTIONS}" \
-e MAX_CONCURRENT_WORKFLOWS=${MAX_CONCURRENT_WORKFLOWS} \
-e SUBMITTER_NOTIFICATION_MASK=${SUBMITTER_NOTIFICATION_MASK} \
-e COMPOSE_PROJECT_NAME=${COMPOSE_PROJECT_NAME} \
--privileged \
sagebionetworks/synapseworkflowhook

brucehoff commented 5 years ago

Running the above (and submitting a job) I am able to replicate what the UW folks encountered, as shown below. The workflow hook runs, downloads the workflow and starts the Toil container, but Toil is not able to see the workflow it needs to run:

STDERR: 2019-07-18T14:26:18.134200065Z Traceback (most recent call last):
STDERR: 2019-07-18T14:26:18.134273484Z   File "/usr/local/bin/toil-cwl-runner", line 10, in <module>
STDERR: 2019-07-18T14:26:18.134286088Z     sys.exit(main())
STDERR: 2019-07-18T14:26:18.134295174Z   File "/usr/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 1200, in main
STDERR: 2019-07-18T14:26:18.135198206Z     loading_context.fetcher_constructor)
STDERR: 2019-07-18T14:26:18.135248397Z   File "/usr/local/lib/python2.7/site-packages/cwltool/load_tool.py", line 86, in resolve_tool_uri
STDERR: 2019-07-18T14:26:18.135769949Z     uri = resolver(document_loader, argsworkflow)
STDERR: 2019-07-18T14:26:18.135818886Z   File "/usr/local/lib/python2.7/site-packages/cwltool/resolver.py", line 44, in tool_resolver
STDERR: 2019-07-18T14:26:18.136356452Z     ret = r(document_loader, uri)
STDERR: 2019-07-18T14:26:18.136411953Z   File "/usr/local/lib/python2.7/site-packages/cwltool/resolver.py", line 21, in resolve_local
STDERR: 2019-07-18T14:26:18.136456183Z     if pathobj.is_file():
STDERR: 2019-07-18T14:26:18.136474956Z   File "/usr/local/lib/python2.7/site-packages/pathlib2/__init__.py", line 1575, in is_file
STDERR: 2019-07-18T14:26:18.137362524Z     return S_ISREG(self.stat().st_mode)
STDERR: 2019-07-18T14:26:18.137392060Z   File "/usr/local/lib/python2.7/site-packages/pathlib2/__init__.py", line 1356, in stat
STDERR: 2019-07-18T14:26:18.137570073Z     return self._accessor.stat(self)
STDERR: 2019-07-18T14:26:18.137583994Z   File "/usr/local/lib/python2.7/site-packages/pathlib2/__init__.py", line 541, in wrapped
STDERR: 2019-07-18T14:26:18.137684245Z     return strfunc(str(pathobj), *args)
STDERR: 2019-07-18T14:26:18.137697512Z OSError: [Errno 13] Permission denied: '/var/lib/docker/volumes/workflow_orchestrator_shared/_data/215e5e3f-7768-490c-a00b-419944dfa066/SynapseWorkflowExample-master/workflow-entrypoint.cwl'

brucehoff commented 5 years ago

Here's an interesting finding: I am able to see the mounted file in another container: After the failure the shared volume remains and the downloaded workflow is still there. I ran a simple 'ubuntu' container, mounting the shared volume and can see the workflow. This tells me there is nothing inherent in UW's environment that precludes sharing files between containers:

docker run -it --rm -v workflow_orchestrator_shared:/shared ubuntu bash
root@d2ac7be48c26:/# cat /shared/215e5e3f-7768-490c-a00b-419944dfa066/SynapseWorkflowExample-master/workflow-entrypoint.cwl
#!/usr/bin/env cwl-runner
#
# Sample workflow
# Inputs:
#   submissionId: ID of the Synapse submission to process
#   adminUploadSynId: ID of a folder accessible only to the submission queue administrator
#   submitterUploadSynId: ID of a folder accessible to the submitter
#   workflowSynapseId:  ID of the Synapse entity containing a reference to the workflow file(s)
#   synapseConfig: configuration file for Synapse client, including login credentials
#
cwlVersion: v1.0
class: Workflow
...

brucehoff commented 5 years ago

Why can the 'ubuntu' container see the file but the Toil container cannot? To investigate, let's see how the Toil container is started up:

docker inspect --format "$(<run.tpl)" workflow_job.1b4617b6-23dc-4727-b904-dc904da47aa8
docker run \
    --name=/workflow_job.1b4617b6-23dc-4727-b904-dc904da47aa8 \
    --env="TMPDIR=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
    --env="TEMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
    --env="TMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
    --env="DOCKER_HOST=unix:///var/run/docker.sock" \
    --env="PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
    --env="LANG=C.UTF-8" \
    --env="PYTHONIOENCODING=UTF-8" \
    --env="GPG_KEY=C01E1CAD5EA2C4F0B8E3571504C367C218ADD4FF" \
    --env="PYTHON_VERSION=2.7.16" \
    --env="PYTHON_PIP_VERSION=19.1.1" \
    --network "bridge" \
     \
    --volume="/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:rw" \
    --volume="/var/run/docker.sock:/var/run/docker.sock:rw" \
    --log-driver="json-file" \
    --log-opt max-file="2" \
    --log-opt max-size="1g" \
    --restart="" \
    --detach=true \
    "sagebionetworks/synapseworkflowhook-toil" \
    "toil-cwl-runner" "--defaultMemory" "100M" "--retryCount" "0" "--defaultDisk" "1000000" "--workDir" "/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" "--noLinkImports" "SynapseWorkflowExample-master/workflow-entrypoint.cwl" "/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108/TMP2051815944166861985.yaml"

We can clean this up a lot, to leave:

docker run -it --rm \
--volume="/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:rw" \
sagebionetworks/synapseworkflowhook-toil bash

(We add in '-it' so we can use it interactively and '--rm' to clean it up when we're done.) Result:

docker run -it --rm --volume="/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:rw" sagebionetworks/synapseworkflowhook-toil bash
root@eeba063cb894:/# more /var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108/SynapseWorkflowExample-master/workflow-entrypoint.cwl
#!/usr/bin/env cwl-runner
#
# Sample workflow
# Inputs:
#   submissionId: ID of the Synapse submission to process
#   adminUploadSynId: ID of a folder accessible only to the submission queue administrator
#   submitterUploadSynId: ID of a folder accessible to the submitter
#   workflowSynapseId:  ID of the Synapse entity containing a reference to the workflow file(s)
#   synapseConfig: configuration file for Synapse client, including login credentials
#
...

Why does it work!?!?! Perhaps my clean up of the docker run command omitted some key element. Restoring as much as possible of the original command did not change anything.

 docker run \
>     --name=/mytest \
>     --env="TMPDIR=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
>     --env="TEMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
>     --env="TMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
>     --env="DOCKER_HOST=unix:///var/run/docker.sock" \
>     --env="PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
>     --env="LANG=C.UTF-8" \
>     --env="PYTHONIOENCODING=UTF-8" \
>     --env="GPG_KEY=C01E1CAD5EA2C4F0B8E3571504C367C218ADD4FF" \
>     --env="PYTHON_VERSION=2.7.16" \
>     --env="PYTHON_PIP_VERSION=19.1.1" \
>     --network "bridge" \
>     --volume="/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:rw" \
>     --volume="/var/run/docker.sock:/var/run/docker.sock:rw" \
>     --log-driver="json-file" \
>     --log-opt max-file="2" \
>     --log-opt max-size="1g" \
>     --restart="" \
>     -it --rm \
>     "sagebionetworks/synapseworkflowhook-toil" bash
root@1b7f68c37403:/#    more /var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108/SynapseWorkflowExample-master/workflow-entrypoint.cwl
#!/usr/bin/env cwl-runner
#
# Sample workflow
# Inputs:
#   submissionId: ID of the Synapse submission to process
#   adminUploadSynId: ID of a folder accessible only to the submission queue administrator
#   submitterUploadSynId: ID of a folder accessible to the submitter
#   workflowSynapseId:  ID of the Synapse entity containing a reference to the workflow file(s)
#   synapseConfig: configuration file for Synapse client, including login credentials
#
cwlVersion: v1.0
...

brucehoff commented 5 years ago

OK, then, let's try running the workflow itself:

docker run \
>     --name=/workflow_job.MANUAL \
>     --env="TMPDIR=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
>     --env="TEMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
>     --env="TMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
>     --env="DOCKER_HOST=unix:///var/run/docker.sock" \
>     --env="PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
>     --env="LANG=C.UTF-8" \
>     --env="PYTHONIOENCODING=UTF-8" \
>     --env="GPG_KEY=C01E1CAD5EA2C4F0B8E3571504C367C218ADD4FF" \
>     --env="PYTHON_VERSION=2.7.16" \
>     --env="PYTHON_PIP_VERSION=19.1.1" \
>     --network "bridge" \
>     --volume="/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:rw" \
>     --volume="/var/run/docker.sock:/var/run/docker.sock:rw" \
>     --log-driver="json-file" \
>     --log-opt max-file="2" \
>     --log-opt max-size="1g" \
>     --restart="" \
>     --detach=true \
>     "sagebionetworks/synapseworkflowhook-toil" \
>     "toil-cwl-runner" "--defaultMemory" "100M" "--retryCount" "0" "--defaultDisk" "1000000" "--workDir" "/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" "--noLinkImports" "SynapseWorkflowExample-master/workflow-entrypoint.cwl" "/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108/TMP2051815944166861985.yaml"

It kicks off, no problem. Let's look at the logs:

docker logs workflow_job.MANUAL
Traceback (most recent call last):
  File "/usr/local/bin/toil-cwl-runner", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 1200, in main
    loading_context.fetcher_constructor)
  File "/usr/local/lib/python2.7/site-packages/cwltool/load_tool.py", line 89, in resolve_tool_uri
    raise ValidationException("Not found: '%s'" % argsworkflow)
schema_salad.validate.ValidationException: Not found: 'SynapseWorkflowExample-master/workflow-entrypoint.cwl'

As when run from the workflow hook it cannot find the workflow file(s). The odd thing is that the error is different: Instead of "access denied" we get 'not found'.

jprosser commented 5 years ago

The host file paths are being used here where I would expect in the case of volumes to be rather named and referenced for use in a container's file system, mounted at the appropriate spot.

-Justin

On Jul 18, 2019 8:37 AM, Bruce Hoff notifications@github.com wrote:

OK, then, let's try running the workflow itself:

docker run \

--name=/workflow_job.MANUAL \
--env="TMPDIR=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
--env="TEMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
--env="TMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
--env="DOCKER_HOST=unix:///var/run/docker.sock" \
--env="PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
--env="LANG=C.UTF-8" \
--env="PYTHONIOENCODING=UTF-8" \
--env="GPG_KEY=C01E1CAD5EA2C4F0B8E3571504C367C218ADD4FF" \
--env="PYTHON_VERSION=2.7.16" \
--env="PYTHON_PIP_VERSION=19.1.1" \
--network "bridge" \
--volume="/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108:rw" \
--volume="/var/run/docker.sock:/var/run/docker.sock:rw" \
--log-driver="json-file" \
--log-opt max-file="2" \
--log-opt max-size="1g" \
--restart="" \
--detach=true \
"sagebionetworks/synapseworkflowhook-toil" \
"toil-cwl-runner" "--defaultMemory" "100M" "--retryCount" "0" "--defaultDisk" "1000000" "--workDir" "/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" "--noLinkImports" "SynapseWorkflowExample-master/workflow-entrypoint.cwl" "/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108/TMP2051815944166861985.yaml"

It kicks off, no problem. Let's look at the logs:

docker logs workflow_job.MANUAL Traceback (most recent call last): File "/usr/local/bin/toil-cwl-runner", line 10, in sys.exit(main()) File "/usr/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 1200, in main loading_context.fetcher_constructor) File "/usr/local/lib/python2.7/site-packages/cwltool/load_tool.py", line 89, in resolve_tool_uri raise ValidationException("Not found: '%s'" % argsworkflow) schema_salad.validate.ValidationException: Not found: 'SynapseWorkflowExample-master/workflow-entrypoint.cwl'

As when run from the workflow hook it cannot find the workflow file(s). The odd thing is that the error is different: Instead of "access denied" we get 'not found'.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Sage-Bionetworks/SynapseWorkflowHook/issues/45?email_source=notifications&email_token=AASUMU57WFMWQYHA4U2RPM3QACE2BA5CNFSM4HYNZQQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2I4IPQ#issuecomment-512869438, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AASUMUZ7Y66ERZRSZX2KBEDQACE2BANCNFSM4HYNZQQQ.

brucehoff commented 5 years ago

I modified the previous command to make the path to the .cwl file absolute, not relative. The workflow appears to run:

docker run \
    --name=/workflow_job.MANUAL \
>     --env="TMPDIR=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
>     --env="TEMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
>     --env="TMP=/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" \
>     --env="DOCKER_HOST=unix:///var/run/docker.sock" \
>     --env="PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
>     --env="LANG=C.UTF-8" \
>     --env="PYTHONIOENCODING=UTF-8" \
>     --env="GPG_KEY=C01E1CAD5EA2C4F0B8E3571504C367C218ADD4FF" \
>     --env="PYTHON_VERSION=2.7.16" \
>     --env="PYTHON_PIP_VERSION=19.1.1" \
>     --network "bridge" \
>     --volume="workflow_orchestrator_shared:/var/lib/docker/volumes/workflow_orchestrator_shared/_data:rw" \
>     --volume="/var/run/docker.sock:/var/run/docker.sock:rw" \
>     --log-driver="json-file" \
>     --log-opt max-file="2" \
>     --log-opt max-size="1g" \
>     --restart="" \
>     --detach=true \
>     "sagebionetworks/synapseworkflowhook-toil" \
>     "toil-cwl-runner" "--defaultMemory" "100M" "--retryCount" "0" "--defaultDisk" "1000000" "--workDir" "/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108" "--noLinkImports" "/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108/SynapseWorkflowExample-master/workflow-entrypoint.cwl" "/var/lib/docker/volumes/workflow_orchestrator_shared/_data/fd3eb6a1-395d-4815-82c2-7a8b37aff108/TMP2051815944166861985.yaml"

I can even see the result uploaded to Synapse: https://www.synapse.org/#!Synapse:syn20540092

So we are not able to replicate the problem seen with the workflow hook by running containers manually.

brucehoff commented 5 years ago

The host file paths are being used here where I would expect in the case of volumes to be rather named and referenced for use in a container's file system, mounted at the appropriate spot.

There's a reason for that which I am happy to explain, but I'm 99.9% sure it's irrelevant to the problem we are sleuthing.

brucehoff commented 5 years ago

Perhaps the issue is somehow related to the use of a relative path to the workflow entry point. I have made is absolute and will try rerunning the workflow with this change to the Hook: https://github.com/Sage-Bionetworks/SynapseWorkflowHook/commit/e96857224f41ec100d6247ce205ebb0c654a7f5a

Result: It worked!

thomasyu888 commented 5 years ago

Thanks for the sleuthing and fix @brucehoff. I will run a workflow tomorrow that will take in a docker submission to see if it works.

brucehoff commented 5 years ago

To summarize, after making a small change to the workflow hook and rerunning it as described above I was able to submit to Synapse and have the workflow run in a Toil container.

@thomasyu888 , @jprosser , and @trberg , as a next step would you like to try running the updated hook? Please note that in the UW environment I don't expect the Hook to be able to run workflows which themselves run containers because the Toil container would have to be run in privileged mode. If this is a requirement we can add a parameter to the Hook to run Toil in privileged mode. Let me know.

brucehoff commented 5 years ago

@thomasyu888 I think you will immediately hit the issue of Toil not being in privileged mode so I added the necessary parameter. Please see https://github.com/Sage-Bionetworks/SynapseWorkflowHook/commit/b291a761ed06d5bca0d14b5d23c328d7195f374f

The updated instructions for non-compose execution:

export DOCKER_ENGINE_URL=unix:///var/run/docker.sock
export SYNAPSE_USERNAME=xxxxx
export SYNAPSE_PASSWORD=xxxxx
export WORKFLOW_OUTPUT_ROOT_ENTITY_ID=synXXXXX
export TOIL_CLI_OPTIONS="--defaultMemory 100M --retryCount 0 --defaultDisk 1000000"
export EVALUATION_TEMPLATES={"xxxxx":"synXXXXX"}
export MAX_CONCURRENT_WORKFLOWS=2
export SUBMITTER_NOTIFICATION_MASK=28
export COMPOSE_PROJECT_NAME=workflow_orchestrator
export RUN_WORKFLOW_CONTAINER_IN_PRIVILEGED_MODE=true

docker volume create ${COMPOSE_PROJECT_NAME}_shared

docker pull sagebionetworks/synapseworkflowhook

docker run -v ${COMPOSE_PROJECT_NAME}_shared:/shared:rw -v /var/run/docker.sock:/var/run/docker.sock:rw \
-e DOCKER_ENGINE_URL=${DOCKER_ENGINE_URL} \
-e SYNAPSE_USERNAME=${SYNAPSE_USERNAME} \
-e SYNAPSE_PASSWORD=${SYNAPSE_PASSWORD} \
-e WORKFLOW_OUTPUT_ROOT_ENTITY_ID=${WORKFLOW_OUTPUT_ROOT_ENTITY_ID} \
-e EVALUATION_TEMPLATES=${EVALUATION_TEMPLATES} \
-e NOTIFICATION_PRINCIPAL_ID=${NOTIFICATION_PRINCIPAL_ID} \
-e SHARE_RESULTS_IMMEDIATELY=${SHARE_RESULTS_IMMEDIATELY} \
-e DATA_UNLOCK_SYNAPSE_PRINCIPAL_ID=${DATA_UNLOCK_SYNAPSE_PRINCIPAL_ID} \
-e TOIL_CLI_OPTIONS="${TOIL_CLI_OPTIONS}" \
-e MAX_CONCURRENT_WORKFLOWS=${MAX_CONCURRENT_WORKFLOWS} \
-e SUBMITTER_NOTIFICATION_MASK=${SUBMITTER_NOTIFICATION_MASK} \
-e COMPOSE_PROJECT_NAME=${COMPOSE_PROJECT_NAME} \
-e RUN_WORKFLOW_CONTAINER_IN_PRIVILEGED_MODE=${RUN_WORKFLOW_CONTAINER_IN_PRIVILEGED_MODE} \
--privileged \
sagebionetworks/synapseworkflowhook

trberg commented 5 years ago

Alright! This seems to have solved the issues, I ran the above command and am now getting issues related to the submitted python script rather than the toil workflow.

Thank you @brucehoff and @thomasyu888 for your help with this!

thomasyu888 commented 5 years ago

@brucehoff The fix you provided did not resolve the issue.

brucehoff commented 5 years ago

@thomasyu888, what are the symptoms?

Verified that the host has the latest images:

[bruce.hoffSAGE@con6 ~]$ docker pull sagebionetworks/synapseworkflowhook 
Using default tag: latest
Trying to pull repository docker.io/sagebionetworks/synapseworkflowhook ... 
latest: Pulling from docker.io/sagebionetworks/synapseworkflowhook
Digest: sha256:5485c7f30fb44d1242eec50d4a2036489c5125e9566fde42255faea2a8559efb
Status: Image is up to date for docker.io/sagebionetworks/synapseworkflowhook:latest
[bruce.hoffSAGE@con6 ~]$ docker pull sagebionetworks/synapseworkflowhook-toil 
Using default tag: latest
Trying to pull repository docker.io/sagebionetworks/synapseworkflowhook-toil ... 
latest: Pulling from docker.io/sagebionetworks/synapseworkflowhook-toil
Digest: sha256:8b6c0c13de69a8d599adbc7c923eb5fdda0ee914cd2911bca23b4f5f310baae4
Status: Image is up to date for docker.io/sagebionetworks/synapseworkflowhook-toil:latest

brucehoff commented 5 years ago

@thomasyu888 a more specific question: Do you have an example of a workflow working directory of the form:

/var/lib/docker/volumes/workflow_orchestrator_shared/_data/<uuid>/

that was created with the latest version of the workflow hook? Can we see the permissions on the folder (as well as the subfolder(s) created by Toil) to see if the 'umask' command produced the intended effect?

thomasyu888 commented 5 years ago

@brucehoff. Actually there is something I would like to try. Yesterday we worked out that providing the 'z' in the docker run volume mount allowed for bind mounts to work. So I wonder if the same z would work. To be specific:

docker run -v /path/to/volume/:/output:z ....

jprosser commented 5 years ago

Please be careful with that option, it will auto-create labels and could really mess up the system. This is likely the cause of our problems before were the whole /var/run got relabeled on the host which basically trashed the host leading us to just recreate it rather than try to recover from that event.

From: Thomas Yu notifications@github.com Sent: Tuesday, July 23, 2019 8:50 AM To: Sage-Bionetworks/SynapseWorkflowHook Cc: Justin Prosser; Mention Subject: Re: [Sage-Bionetworks/SynapseWorkflowHook] Permission Error when running docker-compose (#45)

@brucehoffhttps://github.com/brucehoff. Actually there is something I would like to try. Yesterday we worked out that providing the 'z' in the docker run volume mount allowed for bind mounts to work. So I wonder if the same z would work. To be specific:

docker run -v /path/to/volume/:output:z ....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Sage-Bionetworks/SynapseWorkflowHook/issues/45?email_source=notifications&email_token=AASUMU3SPX2FHWBWDWB3K23QA4SCXA5CNFSM4HYNZQQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2TSGNQ#issuecomment-514270006, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AASUMU6XCUUF5CKC4N6RM3LQA4SCXANCNFSM4HYNZQQQ.

thomasyu888 commented 5 years ago

Thanks @jprosser. I see this on docker site: https://docs.docker.com/storage/bind-mounts/. Should I use the z or Z?

If you use selinux you can add the z or Z options to modify the selinux label of the host file or directory being mounted into the container. This affects the file or directory on the host machine itself and can have consequences outside of the scope of Docker.

The z option indicates that the bind mount content is shared among multiple containers. The Z option indicates that the bind mount content is private and unshared. Use extreme caution with these options. Bind-mounting a system directory such as /home or /usr with the Z option renders your host machine inoperable and you may need to relabel the host machine files by hand.

Important: When using bind mounts with services, selinux labels (:Z and :z), as well as :ro are ignored. See moby/moby #32579 for details.

This example sets the z option to specify that multiple containers can share the bind mount’s contents:

It is not possible to modify the selinux label using the --mount flag.

$ docker run -d \
  -it \
  --name devtest \
  -v "$(pwd)"/target:/app:z \
  nginx:latest

Also @brucehoff specifying the z or Z with the output bind allows use to write to /output. See https://github.com/Sage-Bionetworks/ChallengeWorkflowTemplates/blob/temp/run_docker.cwl#L91-L92.

If we decide that the way we use z or Z is secure, you probably can revert the changes you made with umask?

thomasyu888 commented 5 years ago

https://www.projectatomic.io/blog/2015/06/using-volumes-with-docker-can-cause-problems-with-selinux/

trberg commented 5 years ago

So I've gotten my debug docker submission to run all the way through the pipeline using z and z,ro to mount the volumes. However I only used these options when we mounted volumes in the training and inference script and not with the docker.sock.

trberg commented 5 years ago

The debug docker submission includes reading and writing data to volumes and includes writing a predictions.csv file to output.

thomasyu888 commented 5 years ago

One other solution we discussed for when we bind the training data is to create a docker volume with the hosted data. So:

$ ls test
wowow
docker volume create --name testing -o device=/data/users/thomas.yuSAGE/test -o o=bind
docker run -ti -v testing:/input ubuntu bash
root@a9ce47dca371:/# ls input/
wowow

An interesting discovery is after I create this volume, I also don't see experience the permission error if I mount the directory.

docker run -ti -v /data/users/thomas.yuSAGE/test:/input ubuntu bash
root@14aff9b181bc:/# ls input/
wowow

But... If i create a new directory and don't create a docker volume:

mkdir wow
touch wow/see
docker run -ti -v /data/users/thomas.yuSAGE/wow:/input ubuntu bash
root@aee71b11983e:/# ls input/
ls: cannot open directory 'input/': Permission denied

Does docker "relabel" the directory when a volume is explicitly created?

brucehoff commented 5 years ago

you probably can revert the changes you made with umask?

Once you have determined that everything works, let me know and we can revert the change and then test again (to make sure the reversion doesn't break anything).

thomasyu888 commented 5 years ago

I would like your (@jprosser, @brucehoff, @trberg ) opinions on the Z and z mount as the extent of my knowledge is what I have read. The options currently are:

appending z or Z (Not sure what is the correct one. This does indeed allow us to read and write into the /output and read data from /train
umask / chmod 777 so that the mounted volumes will have the correct permissions. (Haven't gotten this working completely, but we confirmed that changing permissions on the folder itself does allow permissions for the docker container)
Using the docker volume for the training data as I listed above in https://github.com/Sage-Bionetworks/SynapseWorkflowHook/issues/45#issuecomment-514312368. (This doesnt solve the issue of the /output directory.)
Run the submission docker in privileged state - NOT GOOD.

Thanks for all the sleuthing.

brucehoff commented 5 years ago

Since there were concerns about using z I suggest continuing to pursue the approach of changing the sharing permissions on the mounted directory (choice 2 above). To do so, please start by answering my earlier question: https://github.com/Sage-Bionetworks/SynapseWorkflowHook/issues/45#issuecomment-514266402

jprosser commented 5 years ago

Ideally volumes would be used here to keep everything within the container world, at least anything the containers touch. If there's a need to cross that container/host barrier, then permissions should be intentionally managed. If not, we see things like auto-labeling and permissions of 777 which drop that barrier in the most open possible way. I'd guess Docker bind mounts are probably the best way to punch through the container/host barrier but you're still going to need permission management in the end.

-Justin

From: Thomas Yu notifications@github.com Sent: Tuesday, July 23, 2019 11:18 AM To: Sage-Bionetworks/SynapseWorkflowHook Cc: Justin Prosser; Mention Subject: Re: [Sage-Bionetworks/SynapseWorkflowHook] Permission Error when running docker-compose (#45)

I would like your (@jprosserhttps://github.com/jprosser, @brucehoffhttps://github.com/brucehoff, @trberghttps://github.com/trberg ) opinions on the Z and z mount as the extent of my knowledge is what I have read. The options currently are:

appending z or Z (Not sure what is the correct one. This does indeed allow us to read and write into the /output and read data from /train
umask / chmod 777 so that the mounted volumes will have the correct permissions. (Haven't gotten this working completely, but we confirmed that changing permissions on the folder itself does allow permissions for the docker container)
Using the docker volume for the training data as I listed above in #45 (comment)https://github.com/Sage-Bionetworks/SynapseWorkflowHook/issues/45#issuecomment-514312368. (This doesnt solve the issue of the /output directory.)
Run the submission docker in privileged state - NOT GOOD.

Thanks for all the sleuthing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Sage-Bionetworks/SynapseWorkflowHook/issues/45?email_source=notifications&email_token=AASUMU6LFM2QLRKJJM3NRLDQA5DOPA5CNFSM4HYNZQQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2T7UFA#issuecomment-514325012, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AASUMUYFOTLGWIFZZNRHOQDQA5DOPANCNFSM4HYNZQQQ.

jprosser commented 5 years ago

@thomasyu888 a more specific question: Do you have an example of a workflow working directory of the form:
/var/lib/docker/volumes/workflow_orchestrator_shared/_data/<uuid>/
that was created with the latest version of the workflow hook? Can we see the permissions on the folder (as well as the subfolder(s) created by Toil) to see if the 'umask' command produced the intended effect?

In our non-root user scenario, this is not an accessible location by any user login. A user with Docker permission can certainly affect this location, but not access directly.

Sage-Bionetworks / SynapseWorkflowHook

Permission Error when running docker-compose #45