DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

Error creating container while deploying on a Kubernetes cluster #3405

Closed Andreja28 closed 2 years ago

Andreja28 commented 3 years ago

I am trying to deploy a cwl workflow ona kubernetes cluster. I have followed the officially provided instructions and focused on the Option 2: Running the Leader Outside Kubernetes. The workflow I tried is very straightforward:

cwlVersion: v1.0
class: CommandLineTool
baseCommand: echo
requirements:
  - class: ResourceRequirement
    coresMin: 1
    ramMin: 1024

stdout: output.txt
inputs:
  message:
    type: string
    inputBinding:
      position: 1
outputs:
  output:
    type: stdout

I ran the workflow using the command: toil-cwl-runner --jobStore aws:us-west-2:toil-cwl-example --batchSystem kubernetes --realTimeLogging --logInfo workflow.cwl inputs.yaml

AWS works okay, meaning the job store is created but, with executing crashing it never deletes.

After creating a pod, the container is never successfuly created. First there seemed to be an error pulling the quay.io/ucsc_cgl/toil image, but I even pulled the image manualy, but the problem still persists.

Here is an error log from toil-cwl-runner:

ndra-ThinkPad-E550 2021-01-05 11:48:25,934 MainThread INFO botocore.credentials: Found credentials in shared credentials file: ~/.aws/credentials
andra-ThinkPad-E550 2021-01-05 11:48:26,011 MainThread INFO botocore.credentials: Found credentials in shared credentials file: ~/.aws/credentials
andra-ThinkPad-E550 2021-01-05 11:48:26,034 MainThread INFO botocore.credentials: Found credentials in shared credentials file: ~/.aws/credentials
andra-ThinkPad-E550 2021-01-05 11:48:37,118 MainThread INFO cwltool: Resolved 'workflow.cwl' to 'file:///home/andra/Desktop/echo/workflow.cwl'
andra-ThinkPad-E550 2021-01-05 11:48:38,758 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxMemory to physically available memory (8071512064).
andra-ThinkPad-E550 2021-01-05 11:48:38,758 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxDisk to physically available disk (436593963008).
andra-ThinkPad-E550 2021-01-05 11:48:38,780 MainThread INFO toil: Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
andra-ThinkPad-E550 2021-01-05 11:48:38,780 MainThread INFO toil: Using default docker name of toil as TOIL_DOCKER_NAME is not set.
andra-ThinkPad-E550 2021-01-05 11:48:38,780 MainThread INFO toil: Overriding docker appliance of quay.io/ucsc_cgl/toil:3.24.0-de586251cb579bcb80eef435825cb3cedc202f52-py2.7 with quay.io/ucsc_cgl/toil:latest from TOIL_APPLIANCE_SELF.
andra-ThinkPad-E550 2021-01-05 11:48:47,269 MainThread INFO toil: Running Toil version 3.24.0-de586251cb579bcb80eef435825cb3cedc202f52.
andra-ThinkPad-E550 2021-01-05 11:48:47,270 MainThread INFO toil.realtimeLogger: Starting real-time logging.
andra-ThinkPad-E550 2021-01-05 11:48:47,318 MainThread INFO toil.leader: Issued job 'file:///home/andra/Desktop/echo/workflow.cwl' echo 783059e8-1cce-4b9f-a701-3f9b46d236d6 with job batch system ID: 0 and cores: 1, disk: 2.0 G, and memory: 1.0 G
andra-ThinkPad-E550 2021-01-05 11:48:50,841 MainThread INFO toil.realtimeLogger: Stopping real-time logging server.
andra-ThinkPad-E550 2021-01-05 11:48:51,279 MainThread INFO toil.realtimeLogger: Joining real-time logging server thread.
Traceback (most recent call last):
  File "/usr/bin/toil-cwl-runner", line 11, in <module>
    load_entry_point('toil==3.24.0', 'console_scripts', 'toil-cwl-runner')()
  File "/usr/lib/python3/dist-packages/toil/cwl/cwltoil.py", line 1357, in main
    result = toil.start(wf1)
  File "/usr/lib/python3/dist-packages/toil/common.py", line 800, in start
    return self._runMainLoop(rootJobGraph)
  File "/usr/lib/python3/dist-packages/toil/common.py", line 1065, in _runMainLoop
    return Leader(config=self.config,
  File "/usr/lib/python3/dist-packages/toil/leader.py", line 212, in run
    self.innerLoop()
  File "/usr/lib/python3/dist-packages/toil/leader.py", line 541, in innerLoop
    updatedJobTuple = self.batchSystem.getUpdatedBatchJob(maxWait=2)
  File "/usr/lib/python3/dist-packages/toil/batchSystems/kubernetes.py", line 547, in getUpdatedBatchJob
    result = self._getUpdatedBatchJobImmediately()
  File "/usr/lib/python3/dist-packages/toil/batchSystems/kubernetes.py", line 686, in _getUpdatedBatchJobImmediately
    if self._isPodStuckOOM(pod):
  File "/usr/lib/python3/dist-packages/toil/batchSystems/kubernetes.py", line 493, in _isPodStuckOOM
    response = self._api('customObjects').list_namespaced_custom_object('metrics.k8s.io', 'v1beta1',
  File "/home/andra/.local/lib/python3.8/site-packages/kubernetes/client/apis/custom_objects_api.py", line 1442, in list_namespaced_custom_object
    (data) = self.list_namespaced_custom_object_with_http_info(group, version, namespace, plural, **kwargs)
  File "/home/andra/.local/lib/python3.8/site-packages/kubernetes/client/apis/custom_objects_api.py", line 1541, in list_namespaced_custom_object_with_http_info
    return self.api_client.call_api('/apis/{group}/{version}/namespaces/{namespace}/{plural}', 'GET',
  File "/home/andra/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 340, in call_api
    return self.__call_api(resource_path, method,
  File "/home/andra/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 173, in __call_api
    response_data = self.request(method, url,
  File "/home/andra/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 361, in request
    return self.rest_client.GET(url,
  File "/home/andra/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 227, in GET
    return self.request("GET", url,
  File "/home/andra/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': '875982f5-62b0-4778-83cf-5d199c0ded3f', 'X-Kubernetes-Pf-Prioritylevel-Uid': '27c50c6b-fb5e-4eef-afe5-b8bac77e7770', 'Date': 'Tue, 05 Jan 2021 10:48:47 GMT', 'Content-Length': '19'})
HTTP response body: 404 page not found

Also what I find wierd is that multiple successive executions of the workflow produces different events on kubernetes: First run:

default     7m58s       Normal    SuccessfulCreate         job/andra-toil-a389aa55-f915-431e-925f-5183d838b78e-0         Created pod: andra-toil-a389aa55-f915-431e-925f-5183d838b78e-0-xjh4m
default     7m55s       Normal    Scheduled                pod/andra-toil-a389aa55-f915-431e-925f-5183d838b78e-0-xjh4m   Successfully assigned default/andra-toil-a389aa55-f915-431e-925f-5183d838b78e-0-xjh4m to minikube
default     7m42s       Warning   FailedCreatePodSandBox   pod/andra-toil-a389aa55-f915-431e-925f-5183d838b78e-0-xjh4m   Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "andra-toil-a389aa55-f915-431e-925f-5183d838b78e-0-xjh4m": Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: time="2021-01-05T10:41:26Z" level=fatal msg="no such file or directory": unknown

Second run:

default     22s         Normal    SuccessfulCreate         job/andra-toil-5dc28f48-7aa1-4297-b7c3-ed04b79d3950-0         Created pod: andra-toil-5dc28f48-7aa1-4297-b7c3-ed04b79d3950-0-q9s9q
default     22s         Normal    Scheduled                pod/andra-toil-5dc28f48-7aa1-4297-b7c3-ed04b79d3950-0-q9s9q   Successfully assigned default/andra-toil-5dc28f48-7aa1-4297-b7c3-ed04b79d3950-0-q9s9q to minikube
default     18s         Normal    Pulling                  pod/andra-toil-5dc28f48-7aa1-4297-b7c3-ed04b79d3950-0-q9s9q   Pulling image "quay.io/ucsc_cgl/toil:latest"
default     11s         Normal    Pulled                   pod/andra-toil-5dc28f48-7aa1-4297-b7c3-ed04b79d3950-0-q9s9q   Successfully pulled image "quay.io/ucsc_cgl/toil:latest" in 7.01505215s
default     11s         Warning   Failed                   pod/andra-toil-5dc28f48-7aa1-4297-b7c3-ed04b79d3950-0-q9s9q   Error: cannot find volume "tmp" to mount into container "runner-container"

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-771

adamnovak commented 3 years ago

It looks like you might have multiple problems happening.

Here's a line from your log:

andra-ThinkPad-E550 2021-01-05 11:48:38,780 MainThread INFO toil: Overriding docker appliance of quay.io/ucsc_cgl/toil:3.24.0-de586251cb579bcb80eef435825cb3cedc202f52-py2.7 with quay.io/ucsc_cgl/toil:latest from TOIL_APPLIANCE_SELF.

That looks like you're running on Python 2.7. Toil 3.24 still shipped a Python 2.7 build, and quay.io/ucsc_cgl/toil:3.24.0-de586251cb579bcb80eef435825cb3cedc202f52-py2.7 ought to exist and even work, but quay.io/ucsc_cgl/toil:latest is going to point to a random container for a random Toil version on a random minor version of Python 3 (depending on the order they upload in), and is almost certainly not going to work with Toil 3.24 on Python 2.7. I just pulled quay.io/ucsc_cgl/toil:3.24.0-de586251cb579bcb80eef435825cb3cedc202f52-py2.7 and it pulled fine, so you should leave TOIL_APPLIANCE_SELF unset or use TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.24.0-de586251cb579bcb80eef435825cb3cedc202f52-py2.7. Or, even better, upgrade to the latest Toil release on Python 3. Toil 3.24's Kubernetes code is almost a year old, and we've fixed some pretty big bugs since then, although none that I can recall that look quite like your issue.

What problem did you run into pulling quay.io/ucsc_cgl/toil:3.24.0-de586251cb579bcb80eef435825cb3cedc202f52-py2.7?

However, even with the wrong appliance container, the leader should run through and observe all its jobs failing when they start up and can't understand the commands it sent them. Instead, the jobs aren't even starting, and the leader is crashing.

The stack trace looks like you might have a Kubernetes cluster that doesn't have a working metrics service. Can you run kubectl top nodes and see information about the load on the nodes? Toil relies on the metrics service to work around some conditions we see on our infrastructure where pods get stuck and aren't killed when they run out of memory.

I have some code in https://github.com/DataBiosphere/toil/pull/3357/files#diff-9703248ad4bc5dd9a5dde18161239f4759a2e0e079094d5cdbab7d99b7c592abR639 to make the metrics service optional, but it hasn't been merged in yet.

The part about not being able to mount the "tmp" volume is weird. We did have a volume called "tmp" in Toil 3.24, but as far as I can tell we attached it properly.

Can you do kubectl get -o yaml pod YOUR_POD and/or kubectl get -o yaml job YOUR_JOB, so we can check over the YAML descriptions of the broken pods/jobs? Then we could see if the problem is that Toil is asking for impossible things, or that your Kubernetes cluster somehow can't deliver what it is asking for.

alexiswl commented 3 years ago

I've also just had this issue running on AWS Parallel Cluster with an fsx shared file system. Seems, for me, to be a sporadic error. I am rerunning the workflow to confirm:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:59: mounting "/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp27dznulj.tmp" to rootfs at "/var/lib/docker/overlay2/484985ec9ec46423189ba57ad1d0d36cbe12812ab712cdd956fc35717a806c32/merged/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/cobalt.version" caused: open /var/lib/docker/overlay2/484985ec9ec46423189ba57ad1d0d36cbe12812ab712cdd956fc35717a806c32/merged/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/cobalt.version: read-only file system: unknown

Toil attempted to re-run the job but got the same failure.

Here is the docker command run:

docker \
            run \
            -i \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/t36b81keu/tmp-out3igjqd7i,target=/EWRQKr \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/t_nx8tmenguksldxk,target=/tmp \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-fe16cbba-4a5e-4095-9612-8b77834d1275/tmpdnh7jbo7/2b752ad1-1434-4d21-9409-fce17e04a218/t2ggbkkoa/out/SBJ00689_amber,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpcqeagdnp.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210034_L2100109.amber.baf.tsv,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpcq2aqe4m.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210034_L2100109.amber.contamination.vcf.gz,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpp25wz1nu.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210034_L2100109.amber.baf.vcf.gz.tbi,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp512wws2u.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210034_L2100109.amber.baf.vcf.gz,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpd2zwfvyi.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210034_L2100109.amber.contamination.tsv,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpl1t4zk_1.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210033_L2100108.amber.snp.vcf.gz,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpv0tm61fc.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210034_L2100109.amber.qc,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp1_ln026y.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210034_L2100109.amber.baf.pcf,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpw5cwakgx.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210034_L2100109.amber.contamination.vcf.gz.tbi,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpwxv8iffk.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/amber.version,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp04yaf3c5.tmp,target=/var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber/SBJ00689_MDX210033_L2100108.amber.snp.vcf.gz.tbi,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-c31adea0-f156-4be5-b0ad-f56560cfa2f0/tmp95hfcvs1/ea1a0165-aec2-476a-8000-72ec3541db25/tz5zedz2s/out/SBJ00689_cobalt,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp5jpimyg3.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/SBJ00689_MDX210034_L2100109.cobalt.gc.median.tsv,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp90ogyp5c.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/SBJ00689_MDX210033_L2100108.cobalt.gc.median.tsv,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp27dznulj.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/cobalt.version,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp5jpimyg3.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/SBJ00689_MDX210034_L2100109.cobalt.gc.median.tsv,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp90ogyp5c.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/SBJ00689_MDX210033_L2100108.cobalt.gc.median.tsv,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp27dznulj.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/cobalt.version,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpvynnxw_l.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/SBJ00689_MDX210034_L2100109.cobalt.ratio.pcf,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp9k2tepc_.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/SBJ00689_MDX210033_L2100108.cobalt.ratio.median.tsv,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpmi9w3wy6.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/SBJ00689_MDX210034_L2100109.chr.len,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpf626s4uc.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/SBJ00689_MDX210033_L2100108.cobalt.ratio.pcf,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpwjnerpbz.tmp,target=/var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt/SBJ00689_MDX210034_L2100109.cobalt.ratio.tsv,readonly \
            --mount=type=bind,source=/fsx/reference-data/hartwig-nextcloud/hg38_alt/dbs/gc/GC_profile.1000bp.cnp,target=/var/lib/cwl/stg3469bf1d-06dc-423b-bff3-1f58d7ce041a/GC_profile.1000bp.cnp,readonly \
            --mount=type=bind,source=/fsx/input-data/L2100109/SBJ00689__SBJ00689_MDX210034_L2100109-somatic.vcf.gz,target=/var/lib/cwl/stg5c24f3b3-b296-47ef-972b-61c925525096/SBJ00689__SBJ00689_MDX210034_L2100109-somatic.vcf.gz,readonly \
            --mount=type=bind,source=/fsx/input-data/L2100109/SBJ00689__SBJ00689_MDX210034_L2100109-somatic.vcf.gz.tbi,target=/var/lib/cwl/stg5c24f3b3-b296-47ef-972b-61c925525096/SBJ00689__SBJ00689_MDX210034_L2100109-somatic.vcf.gz.tbi,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpmdg6q2oy.tmp,target=/var/lib/cwl/stg6ee2720f-4c43-4858-afa5-d586c7006cae/SBJ00689.gripss.somatic.filtered.vcf.gz,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmpycbq3yfa.tmp,target=/var/lib/cwl/stg6ee2720f-4c43-4858-afa5-d586c7006cae/SBJ00689.gripss.somatic.filtered.vcf.gz.tbi,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp034r1gub.tmp,target=/var/lib/cwl/stg5dc8d902-e653-4e53-be40-b16e9d4d5852/SBJ00689.gripss.somatic.vcf.gz,readonly \
            --mount=type=bind,source=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmphthoce64.tmp,target=/var/lib/cwl/stg5dc8d902-e653-4e53-be40-b16e9d4d5852/SBJ00689.gripss.somatic.vcf.gz.tbi,readonly \
            --mount=type=bind,source=/fsx/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/hg38/hg38.fa,target=/EWRQKr/hg38.fa,readonly \
            --mount=type=bind,source=/fsx/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/hg38/hg38.fa.fai,target=/EWRQKr/hg38.fa.fai,readonly \
            --mount=type=bind,source=/fsx/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/hg38/hg38.dict,target=/EWRQKr/hg38.dict,readonly \
            --workdir=/EWRQKr \
            --read-only=true \
            --user=1000:1000 \
            --rm \
            --env=TMPDIR=/tmp \
            --env=HOME=/EWRQKr \
            --cidfile=/fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/t_nx8tmensxbldsxl/20210318055323-203905.cid \
            quay.io/biocontainers/hmftools-purple:2.53--0 \
            PURPLE \
            -Xms2000m \
            -Xmx14000m \
            -amber \
            /var/lib/cwl/stg9ceede0b-9b6a-41d8-b2bc-c62adb8d2263/SBJ00689_amber \
            -cobalt \
            /var/lib/cwl/stgd1619120-e402-4f9c-8325-af079bfda3de/SBJ00689_cobalt \
            -gc_profile \
            /var/lib/cwl/stg3469bf1d-06dc-423b-bff3-1f58d7ce041a/GC_profile.1000bp.cnp \
            -output_dir \
            SBJ00689_purple \
            -ref_genome \
            /EWRQKr/hg38.fa \
            -reference \
            SBJ00689_MDX210033_L2100108 \
            -somatic_vcf \
            /var/lib/cwl/stg5c24f3b3-b296-47ef-972b-61c925525096/SBJ00689__SBJ00689_MDX210034_L2100109-somatic.vcf.gz \
            -structural_vcf \
            /var/lib/cwl/stg6ee2720f-4c43-4858-afa5-d586c7006cae/SBJ00689.gripss.somatic.filtered.vcf.gz \
            -sv_recovery_vcf \
            /var/lib/cwl/stg5dc8d902-e653-4e53-be40-b16e9d4d5852/SBJ00689.gripss.somatic.vcf.gz \
            -threads \
            2 \
            -tumor \
            SBJ00689_MDX210034_L2100109

Having a look at the offending file, it appears to be a normal file.

$ ls -lsrht /fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp27dznulj.tmp
13K -rwxrw-r-- 3 ec2-user ec2-user 40 Mar 18 00:26 /fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp27dznulj.tmp
$ cat /fsx/toil/workdir/node-dc905bd4-6c39-42a2-973c-975386eb3055-9abb8c7c-553a-457a-a5fe-30930c90b9b7/tmp3ctqh349/3f9875bd-eb53-46c8-8d5e-d85b551142c4/tmp27dznulj.tmp
version=1.10
adamnovak commented 2 years ago

@alexiswl I think your container is failing to be created for a completely different reason than @Andreja28's. You might want to try moving your Toil --workDir off of the shared filesystem and onto node-local storage, since it doesn't need to be shared between nodes.

I'm going to close this issue out; I think the original issue reported here arose from trying to use multiple Toils in the same workflow, some of which didn't actually have working Kubernetes batch system implementations. @Andreja28 Please speak up if you're still seeing this issue in the current Toil release and its default TOIL_APPLIANCE_SELF.