Sage-Bionetworks / SynapseWorkflowHook

Code for linking a workflow engine to a Synapse evaluation queue
Apache License 2.0
4 stars 1 forks source link

Permission Error when running docker-compose #45

Open trberg opened 5 years ago

trberg commented 5 years ago

So we are running into an issue where the command "docker-compose --verbose up" runs into a permissions issue, even when running as sudo:

workflow-hook_1  | [INFO] BUILD FAILURE
workflow-hook_1  | [INFO] ------------------------------------------------------------------------
workflow-hook_1  | [INFO] Total time:  4.398 s
workflow-hook_1  | [INFO] Finished at: 2019-06-14T23:25:23Z
workflow-hook_1  | [INFO] ------------------------------------------------------------------------
workflow-hook_1  | [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:java (default-cli) on project WorkflowHook: An exception occured while executing the Java class. null: InvocationTargetException: org.newsclub.net.unix.AFUNIXSocketException: Permission denied (socket: /run/docker.sock) -> [Help 1]

We find we can bypass this error by running the docker-compose in a privileged state. However, we then run into an other permission error further down the CWL pipeline when trying to pull in docker containers.

STDERR: 2019-06-13T21:33:19.533538167Z WARNING:toil.leader:d/T/jobIXsDkh    Traceback (most recent call last):
STDERR: 2019-06-13T21:33:19.533545557Z WARNING:toil.leader:d/T/jobIXsDkh      File "runDocker.py", line 157, in <module>
STDERR: 2019-06-13T21:33:19.533553710Z WARNING:toil.leader:d/T/jobIXsDkh        main(args)
STDERR: 2019-06-13T21:33:19.533561110Z WARNING:toil.leader:d/T/jobIXsDkh      File "runDocker.py", line 54, in main
STDERR: 2019-06-13T21:33:19.533568944Z WARNING:toil.leader:d/T/jobIXsDkh        for cont in client.containers.list(all=True):
STDERR: 2019-06-13T21:33:19.533576527Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/docker/models/containers.py", line 824, in list
STDERR: 2019-06-13T21:33:19.533586174Z WARNING:toil.leader:d/T/jobIXsDkh        since=since)
STDERR: 2019-06-13T21:33:19.533593970Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/docker/api/container.py", line 191, in containers
STDERR: 2019-06-13T21:33:19.533611794Z WARNING:toil.leader:d/T/jobIXsDkh        res = self._result(self._get(u, params=params), True)
STDERR: 2019-06-13T21:33:19.533620087Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/docker/utils/decorators.py", line 46, in inner
STDERR: 2019-06-13T21:33:19.533627987Z WARNING:toil.leader:d/T/jobIXsDkh        return f(self, *args, **kwargs)
STDERR: 2019-06-13T21:33:19.533635597Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/docker/api/client.py", line 189, in _get
STDERR: 2019-06-13T21:33:19.533643460Z WARNING:toil.leader:d/T/jobIXsDkh        return self.get(url, **self._set_request_timeout(kwargs))
STDERR: 2019-06-13T21:33:19.533651040Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 546, in get
STDERR: 2019-06-13T21:33:19.533658917Z WARNING:toil.leader:d/T/jobIXsDkh        return self.request('GET', url, **kwargs)
STDERR: 2019-06-13T21:33:19.533666414Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
STDERR: 2019-06-13T21:33:19.533674287Z WARNING:toil.leader:d/T/jobIXsDkh        resp = self.send(prep, **send_kwargs)
STDERR: 2019-06-13T21:33:19.533681737Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
STDERR: 2019-06-13T21:33:19.533689548Z WARNING:toil.leader:d/T/jobIXsDkh        r = adapter.send(request, **kwargs)
STDERR: 2019-06-13T21:33:19.533697004Z WARNING:toil.leader:d/T/jobIXsDkh      File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 498, in send
STDERR: 2019-06-13T21:33:19.533705168Z WARNING:toil.leader:d/T/jobIXsDkh        raise ConnectionError(err, request=request)
STDERR: 2019-06-13T21:33:19.533712621Z WARNING:toil.leader:d/T/jobIXsDkh    requests.exceptions.ConnectionError: ('Connection aborted.', error(13, 'Permission denied'))

We are using Redhat (which doesn't support docker-compose) for our OS and are running docker version 1.13.1.

Our reference evaluation pipeline is located here: https://github.com/Sage-Bionetworks/EHR-challenge and is correctly being pulled into the running pipeline.

We had this pipeline up and running at one point but had to restart the VM and now it's broken. The restart updated the OS and docker version but didn't radically change anything.

Any insight would be helpful to troubleshoot this issue.

Thank you

thomasyu888 commented 5 years ago

Hmmm. I seem to be incorrect in my comment... Apologies! However, as z does "relabelling", I'm now unsure if the z option had an impact on the /var/lib/docker/volumes/workflow_orchestrator_shared/_data/ directory which allows for a normal :rw mount now. As the documentation said, using z does in fact change things on the host machine. Prior to using z, I was getting the same error (Permission denied).

/run.sh: 2: /run.sh: cannot create /output/predictions.csv: Permission denied

@brucehoff I unfortunately do not have a directory I could give you. It is also entirely possible that umask fixed the issue...

So @jprosser. Based on your comment. you are suggesting #3 in my comment https://github.com/Sage-Bionetworks/SynapseWorkflowHook/issues/45#issuecomment-514325012. which does indeed work for the input training data.

However, I'm not sure how we would create the docker volume for the /output on the fly (The issue here is that CWL needs to be able to link to the specified output). So the options for that are umask or Z option.

jprosser commented 5 years ago

@thomasyu888 yep, I would prefer that.

So in our UW systems (rhel based), we have a label of "container_file_t" which can allow a container processes to write to that location (as policy exists for such an activity on that label). Users are basically unaffected by this in our environment (though could also be constrained, we don't go that far) but are still subjected to unix permissions of course.

So if a container knew the uid of the user controlling all this, user root in that container could change the owner of some file/dir to that external uid and then the user on the outside would own that file/dir. Since we operate within /data generally with users and data, we could do this to, say /data/common, and create a way for a non-root host users to interact with containers that use root that would otherwise generate uid=0 only files and directories (a 777 mode would create the same but this really is a bad idea, especially with that execute bit set).

brucehoff commented 5 years ago

@thomasyu888, you say:

It is also entirely possible that umask fixed the issue...

Can you explain? Earlier you said that the umask approach failed to fix the issue. Do you have evidence that the changed allowed things to work?

thomasyu888 commented 5 years ago

@brucehoff . Let me walk you to what i did

  1. Last night you had provided the umask approach, so I ran the workflow hook and received a permission denied issue. Its possible that I wasn't using the newest version or something else happened....
  2. After encountering the error, I then thought to use z in mounting the /output as well (since it worked with mounting the /train data), so I did that after Step 1 failed and I found that the workflowhook ran and there were no longer permission issues
  3. I removed the z option to test if the umask approach does work, BUT... I read that z actually changes the host, so I wonder if step 2 actually changed something that allows the hook to run with or without the umask approach.

Does that make sense? One way to test it is to remove the umask approach to see if the hook will run into permission issues.

brucehoff commented 5 years ago

I read that z actually changes the host

Fascinating: I find that I am now unable to reproduce the original error. Could it be that you have modified the host somehow? Here's what I now see:

# create a volume
[bruce.hoffSAGE@con6 ~]$ docker volume create hoff_test1
hoff_test1

# mount the volume to a container running in privileged mode and create a subfolder
[bruce.hoffSAGE@con6 ~]$ docker run -it --rm -v hoff_test1:/test  --privileged  ubuntu bash
root@09b7da83d9ca:/# mkdir /test/privileged_subdir
root@09b7da83d9ca:/# exit
exit

#mount the volume to a container NOT running in privileged mode and write to the subfolder
[bruce.hoffSAGE@con6 ~]$ docker run -it --rm -v hoff_test1:/test  ubuntu bash
root@9ee8462bec15:/# touch /test/privileged_subdir/somefile.txt
root@9ee8462bec15:/# ls -l /test/privileged_subdir
total 0
-rw-r--r--. 1 root root 0 Jul 23 20:38 somefile.txt
root@9ee8462bec15:/# exit

Could you revert the change you made to the host so we are back in the original situation?

One way to test it is to remove the umask approach to see if the hook will run into permission issues.

The test above does that: When I create the folder it has the normal permissions, not '777':

[bruce.hoffSAGE@con6 ~]$ docker run -it --rm -v hoff_test1:/test  --privileged  ubuntu bash
root@86572f347d21:/# cd /test 
root@86572f347d21:/test# ls -al
total 8
drwxr-xr-x.  3 root root 4096 Jul 23 20:38 .
drwxr-xr-x. 22 root root  254 Jul 23 20:42 ..
drwxr-xr-x.  2 root root 4096 Jul 23 20:38 privileged_subdir

As you can see it has 755 permissions.

jprosser commented 5 years ago

@brucehoff The changes that Docker makes with SELinux are not persistent, so if the file system is relabeled for whatever reason, you will be back to the original state. Let me know if you'd like that done.

brucehoff commented 5 years ago

@thomasyu888 Please see @jprosser 's offer, above. I would like to restore the host to its original state. Do you agree?

thomasyu888 commented 5 years ago

I agree. Please restore back to original state.

jprosser commented 5 years ago

I stopped dockerd, relabeled the home directories, then started dockerd just now. I didn't touch the root file system (so /var/lib/docker/volumes) in this relabeling process. Let me know if you'd also like to reset /var/lib/docker/volumes which by default is no write (I don't off hand know if docker will recover from that but it certainly should).

brucehoff commented 5 years ago

I still cannot repro' the original problem:

[bruce.hoffSAGE@con6 ~]$ docker volume rm hoff_test1
hoff_test1
[bruce.hoffSAGE@con6 ~]$ docker volume create hoff_test1
hoff_test1
[bruce.hoffSAGE@con6 ~]$ docker run -it --rm -v hoff_test1:/test  --privileged  ubuntu bash
root@45e6e1b286aa:/# mkdir /test/privileged_subdir
root@45e6e1b286aa:/# exit
exit
[bruce.hoffSAGE@con6 ~]$ docker run -it --rm -v hoff_test1:/test  ubuntu bash
root@7caf15f364bf:/# touch /test/privileged_subdir/somefile.txt
root@7caf15f364bf:/# ls -l /test/privileged_subdir
total 0
-rw-r--r--. 1 root root 0 Jul 23 21:27 somefile.txt
root@7caf15f364bf:/# exit

I didn't touch the root file system (so /var/lib/docker/volumes)

Maybe that's why I see no change.

Let me know if you'd also like to reset /var/lib/docker/volumes which by default is no write

I don't understand. What does "no write" mean in the context of a collection of writeable volumes?

thomasyu888 commented 5 years ago

Now im getting errors again

# No docker volume
[thomas.yuSAGE@con6 ~]$ docker run -ti -v /data/users/thomas.yuSAGE/temp/:/train ubuntu bash
root@966ad861a057:/# ls train/
ls: cannot open directory 'train/': Permission denied
root@966ad861a057:/# exit

# create docker volume
[thomas.yuSAGE@con6 ~]$ docker volume create --name tom_testing -o device=/data/users/thomas.yuSAGE/temp -o o=bind
tom_testing
[thomas.yuSAGE@con6 ~]$ docker volume inspect tom_testing
[
    {
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/tom_testing/_data",
        "Name": "tom_testing",
        "Options": {
            "device": "/data/users/thomas.yuSAGE/temp",
            "o": "bind"
        },
        "Scope": "local"
    }
]

# run same volume above with volume
[thomas.yuSAGE@con6 ~]$ docker run -ti -v tom_testing:/train ubuntu bash
root@966ad861a057:/# ls train/
ls: cannot open directory 'train/': Permission denied
root@966ad861a057:/# exit

Looks like the z and Z option really did make a big difference

EDIT: The volume i created here already existed before, which caused it to not work.

trberg commented 5 years ago

So I know @jprosser isn't a huge fan of the z option, but if we only apply them to the data folders as z,ro and to the /model folder, the /scratch folder, and /output folder as z, will that cause a lot of problems? Previously, we had applied z to /var/run and that really messed us up. But we're currently getting around that with privileged. So can we proceed with the z flag very carefully?

thomasyu888 commented 5 years ago

My understanding is that it not only changes the directory we mount, but possibly the directories in which the mounted directory lives. Which explains the ability for me to do what i did here: https://github.com/Sage-Bionetworks/SynapseWorkflowHook/issues/45#issuecomment-514395466 without getting an error in the end.

jprosser commented 5 years ago

We're all set, I believe, with volumes as @thomasyu888 has recently found, as they create the right kinds of permissions automatically though user root within the container will create root owned files in the user's home dir (or shared area) which they can't readily access.

jprosser commented 5 years ago

@brucehoff to your question on /var/lib/docker/volumes, this has a generic label that prevents container writes. As dockerd doesn't make permanent changes to SELinux policy, a relabeling here would mean dockerd on start would need to fix labels as appropriate (adding container_file_t back) to enable containers to write to these locations (which it probably would do).

That z,Z option is scary since it knocks out what the system is doing, and so if someone wanted to break the host they could just do /var:/var:z and wedge it up. Not entirely different than as root doing something like rm -rf /. Both are a pretty good denial of service.

jprosser commented 5 years ago

@thomasyu888 It looks to me that when you create a volume, that does labeling in a one time fashion such that if we reset the labels as we did, dockerd doesn't come back around to reset as it did on creation.

thomasyu888 commented 5 years ago

Right. Thanks @jprosser .

We have resolved the first issue of binding the training data. Workflow here:

# Create temp directory and temp files
[thomas.yuSAGE@con6 ~]$ mkdir temp
[thomas.yuSAGE@con6 ~]$ touch temp/foo temp/roo
# Show volumes
[thomas.yuSAGE@con6 ~]$ docker volume ls
DRIVER              VOLUME NAME
local               hoff_test1
local               workflow_orchestrator_shared
[thomas.yuSAGE@con6 ~]$ ls temp/
foo  roo
# Create new volume mounting device
[thomas.yuSAGE@con6 ~]$  docker volume create --name tom_testing -o device=/data/users/thomas.yuSAGE/temp -o o=bind
tom_testing

[thomas.yuSAGE@con6 ~]$ docker inspect tom_testing
[
    {
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/tom_testing/_data",
        "Name": "tom_testing",
        "Options": {
            "device": "/data/users/thomas.yuSAGE/temp",
            "o": "bind"
        },
        "Scope": "local"
    }
]
# Use volume name
[thomas.yuSAGE@con6 ~]$ docker run -ti -v tom_testing:/train ubuntu bash
root@ed289c26565e:/# ls train/
foo  roo
root@ed289c26565e:/# exit
exit
trberg commented 5 years ago

So I've tried to replicate this on the challenge server. When I create a volume and mount a data folder that has not previously had the z flag, I still run into permission errors when I later use that volume in the pipeline.

Did I miss a step?

docker volume create --name uw_train -o device=/data/common/dream/data/UW_OMOP/train -o o=bind
[[trberg@con4 dream]$ docker volume inspect uw_train
[
    {
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/uw_train/_data",
        "Name": "uw_train",
        "Options": {
            "device": "/data/common/dream/data/UW_OMOP/train",
            "o": "bind"
        },
        "Scope": "local"
    }
]

Then later in the run_training_docker.cwl

input_dir="uw_train"
mounted_volumes = {scratch_dir:'/scratch:z',
                               input_dir:'/train:ro',
                               model_dir:'/model:z'}

The resulting error is such:

all files in /train
Traceback (most recent call last):
  File "/app/train.py", line 22, in <module>
    for i in os.listdir("/train"):
PermissionError: [Errno 13] Permission denied: '/train'

However, running the ubuntu test seems to work:

[trberg@con4 dream]$ docker run -it --rm -v uw_train:/data:ro ubuntu bash
root@e26acaeeb1b8:/# ls data
condition_occurrence.csv  death.csv  drug_exposure.csv  person.csv  visit_occurrence.csv
root@e26acaeeb1b8:/# touch data/death.csv 
touch: cannot touch 'data/death.csv': Read-only file system
thomasyu888 commented 5 years ago

Did you update your inputdir to be uw_train? Hmm.... I wouldn't think that Ubuntu has anything to do with it.

trberg commented 5 years ago

Yep, I replaced the absolute path image

thomasyu888 commented 5 years ago

So its very strange... When i do:

docker run -ti -v tom_testing:/train docker.synapse.org/syn18405992/debug:v1 bash
root@c0ab9356ed2b:/app# bash /app/train.sh 
current working directory: /app
all files in /app
train.py
infer.py
infer.sh
train.sh
/train exists: True
/train/visit_occurrence.csv exists: False

/train and file permission mask: 775
all files in /train
roo
foo
/model exists: False
/scratch exists: False

It might be possible that that something is happening when the submission is run with the docker socket that is mounted into the toil container.

trberg commented 5 years ago

Yeah, I get the same

[trberg@con4 dream]$ docker run -ti -v uw_train:/train docker.synapse.org/syn18405992/debug:v1 bash
root@9a04ae8947ce:/app# ls
infer.py  infer.sh  train.py  train.sh
root@9a04ae8947ce:/app# bash train.sh
current working directory: /app
all files in /app
train.py
infer.py
infer.sh
train.sh
/train exists: True
/train/visit_occurrence.csv exists: True

/train and file permission mask: 775
all files in /train
visit_occurrence.csv
person.csv
drug_exposure.csv
condition_occurrence.csv
death.csv
/model exists: False
/scratch exists: False
jprosser commented 5 years ago

@trberg I see you had the following with the :z option:

mounted_volumes = {scratch_dir:'/scratch:z', input_dir:'/train:ro', model_dir:'/model:z'}

But looking at the labeling on those files and directories, I see the label container_var_lib_t which is what Docker has for non-container accessed locations, whereas normally that'd be container_file_t for container accessible files so it seems the :z prevented container access.

Indeed, after now firing up a container to just look at it -v workflow_orchestrator_shared:/data, now the labeling is container_file_t so this is a bit of a moving target.