broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
996 stars 360 forks source link

Cromwell hangs on WaitingForReturnCode when running heavily scattered tasks #6946

Open aofarrel opened 2 years ago

aofarrel commented 2 years ago

I set up a heavily scattered (~1000x) workflow to run overnight on my local machine (Mac OS Catalina, Intel hardware). The machine is set up to never sleep and was connected to AC power. My Cromwell config is set to only run one task at a time, ie, only one shard of a scattered task runs at a time. The workflow stopped processing on shard 885, which was the 233rd shard to start (shards appear to start in a random order, that's not an issue).

It looks like the Docker container in question is getting created, but not used. The container is not running according to Docker Desktop and the Docker CLI tools (see output below).

Workflow

I've seen this happen with a few workflows, but this time around it's this one (failure is occurring on second task): https://github.com/aofarrel/SRANWRP/blob/bioproject_stuff/workflows/is_this_tuberculosis.wdl

Ruled out

Docker container logs

docker logs cf6f4828adc61eacf06337ce3caf2c110df6cc04937530a90bbfb0843acbb528 gives no output.

Entering the container

docker exec -it cf6f4828adc61eacf06337ce3caf2c110df6cc04937530a90bbfb0843acbb528 /bin/sh returns Error response from daemon: Container cf6f4828adc61eacf06337ce3caf2c110df6cc04937530a90bbfb0843acbb528 is not running

Docker inspect

>docker inspect cf6f4828adc61eacf06337ce3caf2c110df6cc04937530a90bbfb0843acbb528
[
    {
        "Id": "cf6f4828adc61eacf06337ce3caf2c110df6cc04937530a90bbfb0843acbb528",
        "Created": "2022-11-09T05:37:16.872195116Z",
        "Path": "/bin/bash",
        "Args": [
            "/bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-885/execution/script"
        ],
        "State": {
            "Status": "created",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "0001-01-01T00:00:00Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
        "Image": "sha256:1c9072d6415cff6481014f64cf7486482dc61620bf09806bafffb1697bf344b1",
        "ResolvConfPath": "",
        "HostnamePath": "",
        "HostsPath": "",
        "LogPath": "/var/lib/docker/containers/cf6f4828adc61eacf06337ce3caf2c110df6cc04937530a90bbfb0843acbb528/cf6f4828adc61eacf06337ce3caf2c110df6cc04937530a90bbfb0843acbb528-json.log",
        "Name": "/zen_almeida",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "/private/var/folders/vp/327wktbj3wqb65q3v3n8qpxc0000gn/T/1667969838761-0/dockstore-is-cool/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-885:/bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-885"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "default",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "no",
                "MaximumRetryCount": 0
            },
            "AutoRemove": true,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "CgroupnsMode": "private",
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "private",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": [],
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": false,
            "PidsLimit": null,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/asound",
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware"
            ],
            "ReadonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/aa7c784c947752f9736d649c2f7f1d1ff992e94a295a7d8b3281eb18b60192f0-init/diff:/var/lib/docker/overlay2/pi10oyxckwgkkcbtbhtopthgo/diff:/var/lib/docker/overlay2/ijx4ivzmz9j7r2z9sqnxxkfr2/diff:/var/lib/docker/overlay2/jv5021rm0ro1ncxdxk9z7z4z2/diff:/var/lib/docker/overlay2/l7ti7s4rs2dxrvvlh4xry72fn/diff:/var/lib/docker/overlay2/0ecrq5dfyezvcc19ewmxvgohh/diff:/var/lib/docker/overlay2/vpue5u7q3ouhkmqlyrwbg5n75/diff:/var/lib/docker/overlay2/eh958b0fn0cjj2btiaxfhimxg/diff:/var/lib/docker/overlay2/luqo2vxej2gtqb1i70juiilf3/diff:/var/lib/docker/overlay2/jn9pv5erse0hvoixil8mb2h0k/diff:/var/lib/docker/overlay2/trwrj89a2zln5c2wh9ci255nm/diff:/var/lib/docker/overlay2/8a43205cbfc58e029de1e272d9f65a3ce190c2dd009321376f08b51b8784cf2e/diff",
                "MergedDir": "/var/lib/docker/overlay2/aa7c784c947752f9736d649c2f7f1d1ff992e94a295a7d8b3281eb18b60192f0/merged",
                "UpperDir": "/var/lib/docker/overlay2/aa7c784c947752f9736d649c2f7f1d1ff992e94a295a7d8b3281eb18b60192f0/diff",
                "WorkDir": "/var/lib/docker/overlay2/aa7c784c947752f9736d649c2f7f1d1ff992e94a295a7d8b3281eb18b60192f0/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/private/var/folders/vp/327wktbj3wqb65q3v3n8qpxc0000gn/T/1667969838761-0/dockstore-is-cool/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-885",
                "Destination": "/bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-885",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
        "Config": {
            "Hostname": "cf6f4828adc6",
            "Domainname": "",
            "User": "",
            "AttachStdin": true,
            "AttachStdout": true,
            "AttachStderr": true,
            "Tty": false,
            "OpenStdin": true,
            "StdinOnce": true,
            "Env": [
                "PATH=/root/miniconda3/bin:/bin:/root/edirect/:/bin/sratoolkit.3.0.0-ubuntu64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "PERL5LIB=/perlstuff:"
            ],
            "Cmd": [
                "/bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-885/execution/script"
            ],
            "Image": "ashedpotatoes/sranwrp@sha256:15a8875a024231bd0bf3b5cfc0c4fd4cdb654e9475cd44b2ca251c75dae7fee8",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": [
                "/bin/bash"
            ],
            "OnBuild": null,
            "Labels": {}
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "",
                    "EndpointID": "",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "",
                    "DriverOpts": null
                }
            }
        }
    }
]

Terminal output (first couple shards trimmed of course)

[2022-11-08 21:36:56,76] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:981:1]: executing: docker run \
  --rm -i \
   \
  --entrypoint /bin/bash \
  -v /private/var/folders/vp/327wktbj3wqb65q3v3n8qpxc0000gn/T/1667969838761-0/dockstore-is-cool/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-981:/bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-981 \
  ashedpotatoes/sranwrp@sha256:15a8875a024231bd0bf3b5cfc0c4fd4cdb654e9475cd44b2ca251c75dae7fee8 /bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-981/execution/script
[2022-11-08 21:37:01,60] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:981:1]: job id: 3872
[2022-11-08 21:37:01,60] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:981:1]: Status change from - to Done
[2022-11-08 21:37:06,75] [info] Assigned new job execution tokens to the following groups: 7570b5dc: 1
[2022-11-08 21:37:06,76] [warn] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:727:1]: Unrecognized runtime attribute keys: preemptible, disks, cpu, memory
[2022-11-08 21:37:06,76] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:727:1]: esearch -db sra -query SAMEA3697006 | \
    esummary | \
    xtract -pattern DocumentSummary -element Run@acc,Organism@taxid,Organism@ScientificName >> SAMEA3697006_organisms.txt
[2022-11-08 21:37:06,76] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:727:1]: executing: docker run \
  --rm -i \
   \
  --entrypoint /bin/bash \
  -v /private/var/folders/vp/327wktbj3wqb65q3v3n8qpxc0000gn/T/1667969838761-0/dockstore-is-cool/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-727:/bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-727 \
  ashedpotatoes/sranwrp@sha256:15a8875a024231bd0bf3b5cfc0c4fd4cdb654e9475cd44b2ca251c75dae7fee8 /bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-727/execution/script
[2022-11-08 21:37:11,60] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:727:1]: job id: 3886
[2022-11-08 21:37:11,60] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:727:1]: Status change from - to Done
[2022-11-08 21:37:16,76] [info] Assigned new job execution tokens to the following groups: 7570b5dc: 1
[2022-11-08 21:37:16,76] [warn] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:885:1]: Unrecognized runtime attribute keys: preemptible, disks, cpu, memory
[2022-11-08 21:37:16,76] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:885:1]: esearch -db sra -query SAMEA4695109 | \
    esummary | \
    xtract -pattern DocumentSummary -element Run@acc,Organism@taxid,Organism@ScientificName >> SAMEA4695109_organisms.txt
[2022-11-08 21:37:16,76] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:885:1]: executing: docker run \
  --rm -i \
   \
  --entrypoint /bin/bash \
  -v /private/var/folders/vp/327wktbj3wqb65q3v3n8qpxc0000gn/T/1667969838761-0/dockstore-is-cool/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-885:/bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-885 \
  ashedpotatoes/sranwrp@sha256:15a8875a024231bd0bf3b5cfc0c4fd4cdb654e9475cd44b2ca251c75dae7fee8 /bark-bark/IS_THIS_TUBERCULOSIS/7570b5dc-4714-48a7-96b2-9c62245e3618/call-get_organism_names/shard-885/execution/script
[2022-11-08 21:37:21,60] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:885:1]: job id: 3899
[2022-11-08 21:37:21,60] [info] BackgroundConfigAsyncJobExecutionActor [7570b5dcIS_THIS_TUBERCULOSIS.get_organism_names:885:1]: Status change from - to WaitingForReturnCode
aednichols commented 2 years ago

I 1000% admire the hustle here but I worry you are exceeding the reasonably expected capabilities of Docker Desktop and the Cromwell local backend. As a first step, can you tell your Docker is still functional and accepting new work? When the task requested by Cromwell is frozen, does "hello world" run in a separate terminal complete correctly?

aofarrel commented 2 years ago

I already ruled that out -- Docker itself is still functional. I didn't run hello world specifically, but I did was able to docker run --it and run a few commands. This is different from the behavior I see when I do not set the concurrent job limit to 1 on a local backend -- in that scenario I wouldn't be able to run any images at all, and need to forcibly quit + restart Docker to use it.

For comparison, I ran the same WDL with the same inputs in miniwdl to see if it'd also get stuck, but it did not have this issue. miniwdl was able to complete the 1000x scattered task + the final task that gathers the scattered input. So it seems that Docker itself can handle launching a thousand containers one at a time.