Closed marcel-dempers closed 5 years ago
I have run into the same issue over here. Would love to see this fixed.
Update: This issue seems to be specific to a node. Making the node unschedulable, restarting the node and rerunning the workflow seems to solve this. Perhaps it's something weird in Kubernetes?
Can confirm we have run into this bug as well running k8s 1.11.x on AWS EKS with argo v2.2.0 same error message and same symptoms (more or less)
This issue could be related to https://github.com/moby/moby/issues/29638. Restarting the node fixed it for us, but it isn't really feasible in the long run.
Some background about the implementation. The controller has a requirement to signal the wait sidecar to stop its pod. This is used in the following circumstances:
argo terminate WORKFLOW
The mechanism which is used to signal the pod to stop, is that controller updates a pod annotation of the workflow pod, with an indication that it should stop. The wait sidecar will (eventually) notice this using several different mechanisms:
The most common scenario is that the second mechanism is the way the wait sidecar is notified to stop. This is because through observation, it can take several minutes for a pod annotation file to be updated. fsnotify is not critical to the path for receiving updates and we can switch to something else, such as polling the file every 20 seconds or so.
That said, at this point, "no space left on device" because of to fsnotify against the annotation file is probably just a theory which needs to be tested. Since we're unable reproduce this ourselves, we can provide a image which has a potential fix using polling instead of fsnotify.
I don't know if its of any help, but for context we have the same workflow running in two different clusters: a dev cluster and a prod cluster. Both clusters are the same (in terms of argo version and kubernetes version) but the dev cluster runs significantly more workflows than the prod one and this issue is only present in the dev cluster, not the prod cluster.
The only other difference of note is developers have an RBAC that allows them to submit workflows directly in the dev cluster, but I'm wondering if this is somehow correlated to the number of workflows being run (explaining possibly why @jessesuen is having trouble reproducing). I am also having a hard time reproducing this outside of the one cluster that is affected.
Please let me know if I can be of any more help!
EDIT: a little more context... this is a workflow that spawns 30+ pods to process a bunch of weather files in parallel, failure of this nature is intermittent and only affects some of the pods. Below is the output from a failed and a successful 'wait' container in different pod in the same workflow:
failed:
time="2019-01-25T22:24:31Z" level=info msg="Creating a docker executor"
time="2019-01-25T22:24:31Z" level=info msg="Executor (version: v2.2.0, build_date: 2018-08-30T08:52:54Z) initialized with template:\narchiveLocation: {}\ninputs:\n parameters:\n - name: forecastDirName\n value: norcal-v0.4/sdwx-nca-op-2019-01-25-thru-2019-01-26\n - name: gribNcep\n value: https://nomads.ncep.noaa.gov/pub/data/nccf/com/nam/prod/nam.20190125/nam.t12z.awphys17.tm00.grib2\n - name: gribSd\n value: s3://S3_BUCKET_REDACTED/weatherdata/nam/2019/01/25/12/native/nam.t12z.awip32.0p25.f017.2019.01.25\n - name: gribUn\n value: s3://S3_BUCKET_REDACTED/nam/2019/01/25/12/ungrib/FILE:2019-01-26_05\nmetadata: {}\nname: where-to-get-file\noutputs: {}\nscript:\n command:\n - /bin/sh\n env:\n - name: AWS_ACCESS_KEY_ID\n valueFrom:\n secretKeyRef:\n key: access_key\n name: aws-bucket-creds\n - name: AWS_SECRET_ACCESS_KEY\n valueFrom:\n secretKeyRef:\n key: secret_key\n name: aws-bucket-creds\n - name: AWS_DEFAULT_REGION\n value: us-west-1\n image: mesosphere/aws-cli:1.14.5\n name: \"\"\n resources: {}\n source: |\n # Case this is a ungrib file you cannot create in parallel, we'll deal with this later\n if [ -z https://nomads.ncep.noaa.gov/pub/data/nccf/com/nam/prod/nam.20190125/nam.t12z.awphys17.tm00.grib2 ]\n then\n echo '{\"gribNcepDownload\": \"False\", \"gribSdDownload\": \"False\", \"gribNcep\": \"\", \"gribSd\": \"\", \"gribUn\": \"\"}'\n exit 0\n # Case we already have the ungribbed file\n elif aws s3 ls s3://S3_BUCKET_REDACTED/nam/2019/01/25/12/ungrib/FILE:2019-01-26_05 > /dev/null\n then\n echo '{\"gribNcep\": \"\", \"gribSd\": \"\", \"gribUn\": \"\", \"gribNcepDownload\": \"False\", \"gribSdDownload\": \"False\"}'\n exit 0\n # Case where the SD grib file exists but has not been ungribbed\n elif aws s3 ls s3://S3_BUCKET_REDACTED/weatherdata/nam/2019/01/25/12/native/nam.t12z.awip32.0p25.f017.2019.01.25 > /dev/null\n then\n echo '{\"gribNcepDownload\": \"False\", \"gribSdDownload\": \"True\", \"gribSd\": \"s3://S3_BUCKET_REDACTED/weatherdata/nam/2019/01/25/12/native/nam.t12z.awip32.0p25.f017.2019.01.25\", \"gribNcep\": \"\", \"gribUn\": \"s3://S3_BUCKET_REDACTED/nam/2019/01/25/12/ungrib/FILE:2019-01-26_05\"}'\n exit 0\n else\n echo '{\"gribNcepDownload\": \"True\", \"gribSdDownload\": \"False\", \"gribSd\": \"s3://S3_BUCKET_REDACTED/weatherdata/nam/2019/01/25/12/native/nam.t12z.awip32.0p25.f017.2019.01.25\", \"gribNcep\": \"https://nomads.ncep.noaa.gov/pub/data/nccf/com/nam/prod/nam.20190125/nam.t12z.awphys17.tm00.grib2\", \"gribUn\": \"s3://S3_BUCKET_REDACTED/nam/2019/01/25/12/ungrib/FILE:2019-01-26_05\"}'\n fi\n volumeMounts:\n - mountPath: /forecast\n name: wrf-forecast\n"
time="2019-01-25T22:24:31Z" level=info msg="Waiting on main container"
time="2019-01-25T22:24:33Z" level=info msg="main container started with container ID: b69e2e0a8dd2205e9451b59abbad0655b4967fd8414b8722ac23c8e68cfbc37c"
time="2019-01-25T22:24:33Z" level=info msg="Starting annotations monitor"
time="2019-01-25T22:24:33Z" level=fatal msg="no space left on device"
success: (Executor log lines redacted as they are the same)
time="2019-01-25T22:24:52Z" level=info msg="Waiting on main container"
time="2019-01-25T22:25:50Z" level=info msg="main container started with container ID: cdc31e5028cde621df17c691c5fb844470675b3338b3c54f02dc001504c7353b"
time="2019-01-25T22:25:50Z" level=info msg="Starting annotations monitor"
time="2019-01-25T22:25:50Z" level=info msg="docker wait cdc31e5028cde621df17c691c5fb844470675b3338b3c54f02dc001504c7353b"
time="2019-01-25T22:25:50Z" level=info msg="Starting deadline monitor"
time="2019-01-25T22:25:50Z" level=info msg="Main container completed"
time="2019-01-25T22:25:50Z" level=info msg="No sidecars"
time="2019-01-25T22:25:50Z" level=info msg="No output artifacts"
time="2019-01-25T22:25:50Z" level=info msg="No output parameters"
time="2019-01-25T22:25:50Z" level=info msg="Capturing script output"
time="2019-01-25T22:25:50Z" level=info msg="[docker logs cdc31e5028cde621df17c691c5fb844470675b3338b3c54f02dc001504c7353b]"
time="2019-01-25T22:25:50Z" level=info msg="Annotations monitor stopped"
time="2019-01-25T22:25:50Z" level=info msg="Annotating pod with output"
time="2019-01-25T22:25:50Z" level=info msg="Alloc=3755 TotalAlloc=17166 Sys=13482 NumGC=7 Goroutines=7"
Hello @lb-saildrone , I've implemented changes which @jessesuen described in the previous message . ( PR link https://github.com/argoproj/argo/pull/1194).
Since we are not sure if this is really going to help it would be great if you help us test it. I've cherry-picked the fix to v2.2 branch and built executor docker image alexmt/argoexec:v2.2.1-50dddbf
. It would be great if you configure workflow-controller in your dev cluster to use alexmt/argoexec:v2.2.1-50dddbf
and see if it help to solve this issue.
Yes absolutely @alexmt xmt! Looked over the PR was good to have some context on how that annotation gets written.
I will give it a try this afternoon and report back!
@alexmt running the workflows now as I didn't get to it yesterday. so far first few tests look successful! will let this run a little longer and see if any issues come up!
@lb-saildrone thanks a lot for testing it! Annotations get written using kubernetes downward api . Workflow controller might change annotation to signal executor that step was terminated.
@alexmt as of today it is still running without any unexpected behavior. Let me know if I can do anything else to validate this for you.
Hello @lb-saildrone, thanks for confirming the fix helped in your case. PR was merged into master and will be available in next release. I think we can close the issue for now and reopen it if "no space left on device" happens again.
Just ran into similar issue with "no space left on device", but with logs and found this issue. Looking at the old code, not being an expert Golang dev and being late to the party, is it guaranteed that the watcher handle is always closed? Is this row always called: https://github.com/lippertmarkus/argo/blame/5cae6c6a3ffe45c340cb6fe2341ddaac0ec7d16a/workflow/executor/executor.go#L827 or is there other exit points, like the executor being killed, leading to leaked handles? Shouldn't the close action be defered in the goroutine instead?
Is this a BUG REPORT or FEATURE REQUEST?: BUG
What happened: When running a workflow, a step fails. the wait container throws an error but the main container still runs successfully. When looking at the wait containers logs it has:
When I search argo source code I see error is potentially caused by this: https://github.com/argoproj/argo/blob/master/workflow/executor/executor.go#L744
The node has plenty of disk, so it is not that. The issue appears to be related to fs notify on the host (maybe). It seems to correlate to this issue https://github.com/google/cadvisor/issues/1581
I am no expert in this, just trying to join the dots :)
What you expected to happen: I expect the workflow step to run successfully
How to reproduce it (as minimally and precisely as possible): I have produced a repro with simple alpine containers that sleep. This is similar to my company workflow, I've just stripped out sensitive secret mounts and private docker images. I can reproduce the issue with this YAML file:
Anything else we need to know?:
Argo version:
Environment:
$ kubectl version -o yaml clientVersion: buildDate: 2018-10-24T06:54:59Z compiler: gc gitCommit: 17c77c7898218073f14c8d573582e8d2313dc740 gitTreeState: clean gitVersion: v1.12.2 goVersion: go1.10.4 major: "1" minor: "12" platform: linux/amd64 serverVersion: buildDate: 2018-11-26T14:31:35Z compiler: gc gitCommit: 753b2dbc622f5cc417845f0ff8a77f539a4213ea gitTreeState: clean gitVersion: v1.11.5 goVersion: go1.10.3 major: "1" minor: "11" platform: linux/amd64
$ argo get
Error seems misleading:
Name: basic-etl-wgjc5
Namespace: marcel
ServiceAccount: default
Status: Error
Created: Wed Dec 12 23:31:02 +0000 (11 minutes ago)
Started: Wed Dec 12 23:31:02 +0000 (11 minutes ago)
Finished: Wed Dec 12 23:32:00 +0000 (10 minutes ago)
Duration: 58 seconds
STEP PODNAME DURATION MESSAGE ⚠ basic-etl-wgjc5
├-⚠ start-upserts(0:date:20181201,prefix:upserts)
| ├-✔ get-upserts-file basic-etl-wgjc5-3588034397 21s
| ├-✔ sanitise-files basic-etl-wgjc5-3374190847 18s
| └-⚠ upload-google-storage basic-etl-wgjc5-1212580346 16s failed to save outputs: verify serviceaccount marcel:default has necessary privileges └-⚠ start-upserts(1:date:20181202,prefix:upserts)
└-⚠ get-upserts-file basic-etl-wgjc5-2717664139 21s failed to save outputs: verify serviceaccount marcel:default has necessary privileges
$ kubectl logs -c init
kubectl logs basic-etl-wgjc5-1212580346 -c init
Error from server (BadRequest): container init is not valid for pod basic-etl-wgjc5-1212580346
$ kubectl logs -c wait
time="2018-12-12T23:31:45Z" level=info msg="Creating a docker executor"
time="2018-12-12T23:31:45Z" level=info msg="Executor (version: v2.2.1, build_date: 2018-10-11T16:27:29Z) initialized with template:\narchiveLocation:\n s3:\n accessKeySecret:\n key: minio_access_key\n name: minio-credential\n bucket: my-bucket\n endpoint: minio.argo:9000\n insecure: true\n key: basic-etl-wgjc5/basic-etl-wgjc5-1212580346\n secretKeySecret:\n key: minio_access_secret\n name: minio-credential\ncontainer:\n args:\n - sleep 15s\n command:\n - sh\n - -c\n image: alpine:3.8\n name: \"\"\n resources: {}\ninputs: {}\nmetadata:\n labels:\n workflowName: basic-etl-wgjc5\nname: upload-google-storage\noutputs: {}\n"
time="2018-12-12T23:31:45Z" level=info msg="Waiting on main container"
time="2018-12-12T23:31:46Z" level=info msg="main container started with container ID: 33f08e63a1f7d1aca812e3db9c2dfd292e786000d61a10fa943f3bf442892cd8"
time="2018-12-12T23:31:46Z" level=info msg="Starting annotations monitor"
time="2018-12-12T23:31:46Z" level=fatal msg="no space left on device"