Running workflow results in wait container "no space left on device"

marcel-dempers commented 5 years ago

Is this a BUG REPORT or FEATURE REQUEST?: BUG

What happened: When running a workflow, a step fails. the wait container throws an error but the main container still runs successfully. When looking at the wait containers logs it has:

kubectl logs -c wait basic-etl-zgd7c-2488084890
time="2018-12-12T22:29:07Z" level=info msg="Creating a docker executor"
time="2018-12-12T22:29:07Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation:\n  s3:\n    accessKeySecret:\n      key: minio_access_key\n      name: minio-credential\n    bucket: my-bucket\n    endpoint: minio.argo:9000\n    insecure: true\n    key: basic-etl-zgd7c/basic-etl-zgd7c-2488084890\n    secretKeySecret:\n      key: minio_access_secret\n      name: minio-credential\ncontainer:\n  args:\n  - upserts\n  env:\n  - name: DATE\n    value: \"20181202\"\n  - name: AZURE_STORAGE_ACCOUNT_NAME\n    valueFrom:\n      secretKeyRef:\n        key: account_name\n        name: storage-account\n  - name: AZURE_STORAGE_ACCOUNT_KEY\n    valueFrom:\n      secretKeyRef:\n        key: account_key\n        name: storage-account\n  image: eimacraue01.azurecr.io/poc:download-file-blob\n  imagePullPolicy: Always\n  name: \"\"\n  resources: {}\ninputs:\n  parameters:\n  - name: prefix\n    value: upserts\n  - name: date\n    value: \"20181202\"\nmetadata:\n  labels:\n    workflowName: basic-etl-zgd7c\nname: download-file-blob\noutputs:\n  artifacts:\n  - name: files\n    path: /data/files.tar.gz\n"
time="2018-12-12T22:29:07Z" level=info msg="Waiting on main container"
time="2018-12-12T22:29:08Z" level=info msg="main container started with container ID: 7776d8e74a223b028a58cc6f9f5efab7a34098162e810164abd76be077f2d093"
time="2018-12-12T22:29:08Z" level=info msg="Starting annotations monitor"
time="2018-12-12T22:29:08Z" level=fatal msg="no space left on device"

When I search argo source code I see error is potentially caused by this: https://github.com/argoproj/argo/blob/master/workflow/executor/executor.go#L744

The node has plenty of disk, so it is not that. The issue appears to be related to fs notify on the host (maybe). It seems to correlate to this issue https://github.com/google/cadvisor/issues/1581

I am no expert in this, just trying to join the dots :)

What you expected to happen: I expect the workflow step to run successfully

How to reproduce it (as minimally and precisely as possible): I have produced a repro with simple alpine containers that sleep. This is similar to my company workflow, I've just stripped out sensitive secret mounts and private docker images. I can reproduce the issue with this YAML file:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: basic-etl-
spec:
  entrypoint: etl-all
  templates:
  - name: etl-all
    dag:
      tasks:
      - name: start-upserts
        template: etl-upserts
        arguments:
          parameters: [{name: prefix, value: "{{item.prefix}}" },{name: date, value: "{{item.date}}"}]
        withItems:
        - { prefix: 'upserts', date: '20181201' } 
        - { prefix: 'upserts', date: '20181202' } 
  - name: etl-upserts
    inputs:
      parameters:
      - name: prefix
      - name: date
    dag:
      tasks:
      - name: get-upserts-file
        template: download-file-blob
        arguments:
          parameters: [{name: prefix, value: "{{inputs.parameters.prefix}}"},{name: date, value: "{{inputs.parameters.date}}"}]
      - name: sanitise-files
        template: sanitise-files
        dependencies: [get-upserts-file]
      - name: upload-google-storage
        template: upload-google-storage
        dependencies: [sanitise-files]
      - name: load-bigquery-data
        template: load-bigquery-data
        dependencies: [upload-google-storage]
  - name: download-file-blob
    inputs:
      parameters:
      - name: prefix
      - name: date
    metadata:
      labels:
        workflowName: "{{workflow.name}}"
    container:
      image: alpine:3.8
      command: [sh, -c]
      args: ["sleep 15s"]     
      env:
      - name: DATE
        value: "{{inputs.parameters.date}}"
  - name: sanitise-files
    metadata:
      labels:
        workflowName: "{{workflow.name}}"
    container:
      image: alpine:3.8
      command: [sh, -c]
      args: ["sleep 15s"]
      resources:
        limits:
          memory: "1Gi"
          cpu: "2000m"
  - name: upload-google-storage
    metadata:
      labels:
        workflowName: "{{workflow.name}}"
    container:
      image: alpine:3.8
      command: [sh, -c]
      args: ["sleep 15s"]
  - name: load-bigquery-data
    metadata:
      labels:
        workflowName: "{{workflow.name}}"
    container:
      env:
      - name: GCLOUD_KEY_FILE
        value: /secret/key.json
      image: alpine:3.8
      command: [sh, -c]
      args: ["sleep 15s"]

Anything else we need to know?:

Argo version:

curl -sSL https://raw.githubusercontent.com/argoproj/argo/release-2.2/manifests/install.yaml -o install.yaml

Environment:

Argo version: 2.2


$ argo version
argo: v2.2.1
BuildDate: 2018-10-11T16:26:28Z
GitCommit: 3b52b26190163d1f72f3aef1a39f9f291378dafb
GitTreeState: clean
GitTag: v2.2.1
GoVersion: go1.10.3
Compiler: gc
Platform: linux/amd64

- Kubernetes version : 1.11.5
Azure AKS

$ kubectl version -o yaml clientVersion: buildDate: 2018-10-24T06:54:59Z compiler: gc gitCommit: 17c77c7898218073f14c8d573582e8d2313dc740 gitTreeState: clean gitVersion: v1.12.2 goVersion: go1.10.4 major: "1" minor: "12" platform: linux/amd64 serverVersion: buildDate: 2018-11-26T14:31:35Z compiler: gc gitCommit: 753b2dbc622f5cc417845f0ff8a77f539a4213ea gitTreeState: clean gitVersion: v1.11.5 goVersion: go1.10.3 major: "1" minor: "11" platform: linux/amd64


**Other debugging information (if applicable)**:
- workflow result:

$ argo get Error seems misleading: Name: basic-etl-wgjc5 Namespace: marcel ServiceAccount: default Status: Error Created: Wed Dec 12 23:31:02 +0000 (11 minutes ago) Started: Wed Dec 12 23:31:02 +0000 (11 minutes ago) Finished: Wed Dec 12 23:32:00 +0000 (10 minutes ago) Duration: 58 seconds

STEP PODNAME DURATION MESSAGE ⚠ basic-etl-wgjc5
├-⚠ start-upserts(0:date:20181201,prefix:upserts)
| ├-✔ get-upserts-file basic-etl-wgjc5-3588034397 21s
| ├-✔ sanitise-files basic-etl-wgjc5-3374190847 18s
| └-⚠ upload-google-storage basic-etl-wgjc5-1212580346 16s failed to save outputs: verify serviceaccount marcel:default has necessary privileges └-⚠ start-upserts(1:date:20181202,prefix:upserts)
└-⚠ get-upserts-file basic-etl-wgjc5-2717664139 21s failed to save outputs: verify serviceaccount marcel:default has necessary privileges

- executor logs:

$ kubectl logs -c init kubectl logs basic-etl-wgjc5-1212580346 -c init Error from server (BadRequest): container init is not valid for pod basic-etl-wgjc5-1212580346

$ kubectl logs -c wait time="2018-12-12T23:31:45Z" level=info msg="Creating a docker executor" time="2018-12-12T23:31:45Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation:\n s3:\n accessKeySecret:\n key: minio_access_key\n name: minio-credential\n bucket: my-bucket\n endpoint: minio.argo:9000\n insecure: true\n key: basic-etl-wgjc5/basic-etl-wgjc5-1212580346\n secretKeySecret:\n key: minio_access_secret\n name: minio-credential\ncontainer:\n args:\n - sleep 15s\n command:\n - sh\n - -c\n image: alpine:3.8\n name: \"\"\n resources: {}\ninputs: {}\nmetadata:\n labels:\n workflowName: basic-etl-wgjc5\nname: upload-google-storage\noutputs: {}\n" time="2018-12-12T23:31:45Z" level=info msg="Waiting on main container" time="2018-12-12T23:31:46Z" level=info msg="main container started with container ID: 33f08e63a1f7d1aca812e3db9c2dfd292e786000d61a10fa943f3bf442892cd8" time="2018-12-12T23:31:46Z" level=info msg="Starting annotations monitor" time="2018-12-12T23:31:46Z" level=fatal msg="no space left on device"

ObeA commented 5 years ago

I have run into the same issue over here. Would love to see this fixed.

Update: This issue seems to be specific to a node. Making the node unschedulable, restarting the node and rerunning the workflow seems to solve this. Perhaps it's something weird in Kubernetes?

goatsthatcode commented 5 years ago

Can confirm we have run into this bug as well running k8s 1.11.x on AWS EKS with argo v2.2.0 same error message and same symptoms (more or less)

ObeA commented 5 years ago

This issue could be related to https://github.com/moby/moby/issues/29638. Restarting the node fixed it for us, but it isn't really feasible in the long run.

jessesuen commented 5 years ago

Some background about the implementation. The controller has a requirement to signal the wait sidecar to stop its pod. This is used in the following circumstances:

user issued argo terminate WORKFLOW
the workflow reached its deadline and controller needs to stop all pods
a daemon step needs to be brought down because template has completed

The mechanism which is used to signal the pod to stop, is that controller updates a pod annotation of the workflow pod, with an indication that it should stop. The wait sidecar will (eventually) notice this using several different mechanisms:

the annotations are volume projected into a file using the Downward API. argoexec watches the file using fsnotify on this file to notice changes to it to see the signal to stop.
the wait sidecar will receive a SIGUSR2 signal from the controller, which tells it to re-query the Pod using K8s API to notice any new changes to the annotation. In the event that this signal is missed, the wait sidecar will eventually notice via technique 1.

The most common scenario is that the second mechanism is the way the wait sidecar is notified to stop. This is because through observation, it can take several minutes for a pod annotation file to be updated. fsnotify is not critical to the path for receiving updates and we can switch to something else, such as polling the file every 20 seconds or so.

That said, at this point, "no space left on device" because of to fsnotify against the annotation file is probably just a theory which needs to be tested. Since we're unable reproduce this ourselves, we can provide a image which has a potential fix using polling instead of fsnotify.

goatsthatcode commented 5 years ago

I don't know if its of any help, but for context we have the same workflow running in two different clusters: a dev cluster and a prod cluster. Both clusters are the same (in terms of argo version and kubernetes version) but the dev cluster runs significantly more workflows than the prod one and this issue is only present in the dev cluster, not the prod cluster.

The only other difference of note is developers have an RBAC that allows them to submit workflows directly in the dev cluster, but I'm wondering if this is somehow correlated to the number of workflows being run (explaining possibly why @jessesuen is having trouble reproducing). I am also having a hard time reproducing this outside of the one cluster that is affected.

Please let me know if I can be of any more help!

EDIT: a little more context... this is a workflow that spawns 30+ pods to process a bunch of weather files in parallel, failure of this nature is intermittent and only affects some of the pods. Below is the output from a failed and a successful 'wait' container in different pod in the same workflow:

failed:

time="2019-01-25T22:24:31Z" level=info msg="Creating a docker executor"
time="2019-01-25T22:24:31Z" level=info msg="Executor (version: v2.2.0, build_date: 2018-08-30T08:52:54Z) initialized with template:\narchiveLocation: {}\ninputs:\n  parameters:\n  - name: forecastDirName\n    value: norcal-v0.4/sdwx-nca-op-2019-01-25-thru-2019-01-26\n  - name: gribNcep\n    value: https://nomads.ncep.noaa.gov/pub/data/nccf/com/nam/prod/nam.20190125/nam.t12z.awphys17.tm00.grib2\n  - name: gribSd\n    value: s3://S3_BUCKET_REDACTED/weatherdata/nam/2019/01/25/12/native/nam.t12z.awip32.0p25.f017.2019.01.25\n  - name: gribUn\n    value: s3://S3_BUCKET_REDACTED/nam/2019/01/25/12/ungrib/FILE:2019-01-26_05\nmetadata: {}\nname: where-to-get-file\noutputs: {}\nscript:\n  command:\n  - /bin/sh\n  env:\n  - name: AWS_ACCESS_KEY_ID\n    valueFrom:\n      secretKeyRef:\n        key: access_key\n        name: aws-bucket-creds\n  - name: AWS_SECRET_ACCESS_KEY\n    valueFrom:\n      secretKeyRef:\n        key: secret_key\n        name: aws-bucket-creds\n  - name: AWS_DEFAULT_REGION\n    value: us-west-1\n  image: mesosphere/aws-cli:1.14.5\n  name: \"\"\n  resources: {}\n  source: |\n    # Case this is a ungrib file you cannot create in parallel, we'll deal with this later\n    if [ -z https://nomads.ncep.noaa.gov/pub/data/nccf/com/nam/prod/nam.20190125/nam.t12z.awphys17.tm00.grib2 ]\n    then\n      echo '{\"gribNcepDownload\": \"False\", \"gribSdDownload\": \"False\", \"gribNcep\": \"\", \"gribSd\": \"\", \"gribUn\": \"\"}'\n      exit 0\n    # Case we already have the ungribbed file\n    elif aws s3 ls s3://S3_BUCKET_REDACTED/nam/2019/01/25/12/ungrib/FILE:2019-01-26_05 > /dev/null\n    then\n      echo '{\"gribNcep\": \"\", \"gribSd\": \"\", \"gribUn\": \"\", \"gribNcepDownload\": \"False\", \"gribSdDownload\": \"False\"}'\n      exit 0\n    # Case where the SD grib file exists but has not been ungribbed\n    elif aws s3 ls s3://S3_BUCKET_REDACTED/weatherdata/nam/2019/01/25/12/native/nam.t12z.awip32.0p25.f017.2019.01.25 > /dev/null\n    then\n      echo '{\"gribNcepDownload\": \"False\", \"gribSdDownload\": \"True\", \"gribSd\": \"s3://S3_BUCKET_REDACTED/weatherdata/nam/2019/01/25/12/native/nam.t12z.awip32.0p25.f017.2019.01.25\", \"gribNcep\": \"\", \"gribUn\": \"s3://S3_BUCKET_REDACTED/nam/2019/01/25/12/ungrib/FILE:2019-01-26_05\"}'\n      exit 0\n    else\n      echo '{\"gribNcepDownload\": \"True\", \"gribSdDownload\": \"False\", \"gribSd\": \"s3://S3_BUCKET_REDACTED/weatherdata/nam/2019/01/25/12/native/nam.t12z.awip32.0p25.f017.2019.01.25\", \"gribNcep\": \"https://nomads.ncep.noaa.gov/pub/data/nccf/com/nam/prod/nam.20190125/nam.t12z.awphys17.tm00.grib2\", \"gribUn\": \"s3://S3_BUCKET_REDACTED/nam/2019/01/25/12/ungrib/FILE:2019-01-26_05\"}'\n    fi\n  volumeMounts:\n  - mountPath: /forecast\n    name: wrf-forecast\n"
time="2019-01-25T22:24:31Z" level=info msg="Waiting on main container"
time="2019-01-25T22:24:33Z" level=info msg="main container started with container ID: b69e2e0a8dd2205e9451b59abbad0655b4967fd8414b8722ac23c8e68cfbc37c"
time="2019-01-25T22:24:33Z" level=info msg="Starting annotations monitor"
time="2019-01-25T22:24:33Z" level=fatal msg="no space left on device"

success: (Executor log lines redacted as they are the same)

 time="2019-01-25T22:24:52Z" level=info msg="Waiting on main container"
time="2019-01-25T22:25:50Z" level=info msg="main container started with container ID: cdc31e5028cde621df17c691c5fb844470675b3338b3c54f02dc001504c7353b"
time="2019-01-25T22:25:50Z" level=info msg="Starting annotations monitor"
time="2019-01-25T22:25:50Z" level=info msg="docker wait cdc31e5028cde621df17c691c5fb844470675b3338b3c54f02dc001504c7353b"
time="2019-01-25T22:25:50Z" level=info msg="Starting deadline monitor"
time="2019-01-25T22:25:50Z" level=info msg="Main container completed"
time="2019-01-25T22:25:50Z" level=info msg="No sidecars"
time="2019-01-25T22:25:50Z" level=info msg="No output artifacts"
time="2019-01-25T22:25:50Z" level=info msg="No output parameters"
time="2019-01-25T22:25:50Z" level=info msg="Capturing script output"
time="2019-01-25T22:25:50Z" level=info msg="[docker logs cdc31e5028cde621df17c691c5fb844470675b3338b3c54f02dc001504c7353b]"
time="2019-01-25T22:25:50Z" level=info msg="Annotations monitor stopped"
time="2019-01-25T22:25:50Z" level=info msg="Annotating pod with output"
time="2019-01-25T22:25:50Z" level=info msg="Alloc=3755 TotalAlloc=17166 Sys=13482 NumGC=7 Goroutines=7"

alexmt commented 5 years ago

Hello @lb-saildrone , I've implemented changes which @jessesuen described in the previous message . ( PR link https://github.com/argoproj/argo/pull/1194).

Since we are not sure if this is really going to help it would be great if you help us test it. I've cherry-picked the fix to v2.2 branch and built executor docker image alexmt/argoexec:v2.2.1-50dddbf. It would be great if you configure workflow-controller in your dev cluster to use alexmt/argoexec:v2.2.1-50dddbf and see if it help to solve this issue.

goatsthatcode commented 5 years ago

Yes absolutely @alexmt xmt! Looked over the PR was good to have some context on how that annotation gets written.

I will give it a try this afternoon and report back!

goatsthatcode commented 5 years ago

@alexmt running the workflows now as I didn't get to it yesterday. so far first few tests look successful! will let this run a little longer and see if any issues come up!

alexmt commented 5 years ago

@lb-saildrone thanks a lot for testing it! Annotations get written using kubernetes downward api . Workflow controller might change annotation to signal executor that step was terminated.

goatsthatcode commented 5 years ago

@alexmt as of today it is still running without any unexpected behavior. Let me know if I can do anything else to validate this for you.

alexmt commented 5 years ago

Hello @lb-saildrone, thanks for confirming the fix helped in your case. PR was merged into master and will be available in next release. I think we can close the issue for now and reopen it if "no space left on device" happens again.

Mohjive commented 4 years ago

Just ran into similar issue with "no space left on device", but with logs and found this issue. Looking at the old code, not being an expert Golang dev and being late to the party, is it guaranteed that the watcher handle is always closed? Is this row always called: https://github.com/lippertmarkus/argo/blame/5cae6c6a3ffe45c340cb6fe2341ddaac0ec7d16a/workflow/executor/executor.go#L827 or is there other exit points, like the executor being killed, leading to leaked handles? Shouldn't the close action be defered in the goroutine instead?

argoproj / argo-workflows

Running workflow results in wait container "no space left on device" #1128