Duke-GCB / calrissian

CWL on Kubernetes
https://duke-gcb.github.io/calrissian/
MIT License
42 stars 15 forks source link

CWL 1.0 CT173 Test Docker ENTRYPOINT usage #86

Open dleehr opened 5 years ago

dleehr commented 5 years ago
Got workflow error
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py", line 733, in collect_output
    raise WorkflowException("Did not find output file with glob pattern: '{}'".format(globpatterns))
cwltool.errors.WorkflowException: Did not find output file with glob pattern: '['cow']'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py", line 612, in collect_output_ports
    compute_checksum=compute_checksum)
  File "/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py", line 733, in collect_output
    raise WorkflowException("Did not find output file with glob pattern: '{}'".format(globpatterns))
  File "/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py", line 168, in __exit__
    raise self.makeError(six.text_type(exc_value))
cwltool.errors.WorkflowException: v1.0/docker-run-cmd.cwl:14:7: Did not find output file with glob pattern: '['cow']'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/cwltool/executors.py", line 264, in runner
    job.run(runtime_context)
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 560, in run
    self.finish(completion_result, runtimeContext)
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 360, in finish
    outputs = self.collect_outputs(self.outdir)
  File "/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py", line 612, in collect_output_ports
    compute_checksum=compute_checksum)
  File "/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py", line 168, in __exit__
    raise self.makeError(six.text_type(exc_value))
cwltool.errors.WorkflowException: Error collecting output for parameter 'cow':
v1.0/docker-run-cmd.cwl:14:7: Did not find output file with glob pattern: '['cow']'
Workflow error, try again with --debug for more information:
Error collecting output for parameter 'cow':
v1.0/docker-run-cmd.cwl:14:7: Did not find output file with glob pattern: '['cow']'
Test 173 failed: /usr/local/bin/calrissian --max-ram 8G --max-cores 4 --default-container debian:stretch-slim --outdir=/output/tmp8yz5b99m --quiet v1.0/docker-run-cmd.cwl v1.0/empty.json
Test Docker ENTRYPOINT usage
Returned non-zero
dleehr commented 5 years ago

Tool: https://github.com/common-workflow-language/common-workflow-language/blob/master/v1.0/v1.0/docker-run-cmd.cwl Job: https://github.com/common-workflow-language/common-workflow-language/blob/master/v1.0/v1.0/empty.json

dleehr commented 5 years ago

The tool specifies the bash:4.4.12 container, which has an ENTRYPOINT set: https://github.com/tianon/docker-bash/blob/3682e16bca63b20ab51745afa30156e2740fc5c6/4.4/Dockerfile#L126

When calrissian builds up a container to run, it appears to be providing the command in a way that's not compatible with the entrypoint.

YAML from the pod created to run this tool:

apiVersion: v1
kind: Pod
metadata:
  name: docker-run-cmd-cwl-pod-bczhwgdg
  namespace: default
spec:
  containers:
  - args:
    - -c 'echo '"'"'moo'"'"' > cow'
    command:
    - /bin/sh
    - -c
    env:
    - name: HOME
      value: /YgEWAF
    - name: TMPDIR
      value: /tmp
    image: bash:4.4.12
    imagePullPolicy: IfNotPresent
    name: docker-run-cmd-cwl-container
    resources:
      requests:
        cpu: "1"
        memory: 8Mi
    volumeMounts:
    - mountPath: /YgEWAF
      name: conformance-output-data
      subPath: mlq5pjta
    - mountPath: /tmp
      name: tmpdir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-9wdfw
      readOnly: true
    workingDir: /YgEWAF
  volumes:
  - name: conformance-test-data
    persistentVolumeClaim:
      claimName: conformance-test-data
      readOnly: true
  - name: conformance-output-data
    persistentVolumeClaim:
      claimName: conformance-output-data
  - emptyDir: {}
    name: tmpdir
  - name: default-token-9wdfw
dleehr commented 5 years ago

According to https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#notes:

Docker Entrypoint = Kubernetes command Docker Cmd = Kubernetes args

We're building both in KubernetesPodBuilder:

https://github.com/Duke-GCB/calrissian/blob/5cba82c211d3370ffd3c53ba22a5618c56758c67/calrissian/job.py#L205-L206

https://github.com/Duke-GCB/calrissian/blob/5cba82c211d3370ffd3c53ba22a5618c56758c67/calrissian/job.py#L208-L219

dleehr commented 5 years ago

At first glance it might seem like we could straighten this out cleanly, but as mentioned in the comments, we need to provide the command as a space-separated string, and needs to be an argument to a shell. Otherwise we cannot redirect easily inside the container.

It might work to push the ['/bin/sh','-c']into the args and leave container_command totally empty.

dleehr commented 5 years ago

I tried removing the command from the container spec, and pushing ['/bin/sh', '-c'] to the front of args. This stops us from stomping on the entrypoint, and doesn't fail any additional conformance tests. But 173 still fails because calrissian inserts /bin/sh -c in between the entrypoint and the tool command:

  containers:
  - args:
    - /bin/sh
    - -c
    - -c 'echo '"'"'moo'"'"' > cow'

Seems like there's no clean/simple way out of this. We wrap commands in a shell (sh -c) because that's the only way to setup redirection to STDOUT and STDERR differently in kubernetes (see b956f7ebf18b24266670d366c366e11836239e2c and 09261991d08894d0dcd57e7a362beeb498c1b213). That approach works in most cases, but not where the entrypoint and the command can't be separated.

One path would be to inspect the image and incorporate its entrypoint, but that's not a terribly robust solution. I think the only clean solution here would be if we had another way to redirect without using a shell. That would let us simplify args down to just the Job's command_line