Open dleehr opened 5 years ago
Logs from running sleep echo: errors.txt
Some ideas after discussing with @hlapp:
We're certainly eagerly watching logs, and I know we've encountered issues in other projects when expecting streaming connections (e.g. RabbitMQ in lando) to be held open indefinitely. So after the easy/obvious steps of upgrading dependencies and k8s itself, it's probably worth looking at the watch/log connections.
I tried running the sleep-echo workflow
after upgrading the google k8s cluster software to 1.13.7-gke.19
. The cluster become unresponsive after I started the sleep-echo workflow
job. I noticed there was a message about "Upgrading master" in the google cloud console. I let it sit for a couple minutes and the cluster came back. The sleep-echo workflow
job was still there and had a failed status.
I ran the sleep-echo workflow
again and this time it failed with an error from cwltool error that a javascript expression took too long. The cluster remained responsive this time.
I ran sleep-echo workflow
two more times and it worked both times and the cluster remained responsive.
I also ran the sleep-echo workflow
against our hardspin openshift test cluster and it worked fine.
I ran
sleep-echo workflow
two more times and it worked both times and the cluster remained responsive.
So there's certainly a possibility that all of the troubles are due to a cluster upgrade happening in the middle of a job - not at all related to the workload. A surprise cluster upgrade certainly explains:
Great detective work, seems like the auto-upgrade is causing this.
The docs say:
When you create a new cluster using Google Cloud Platform Console, node auto-upgrade is enabled by default.
And when creating a cluster on CLI, I see:
WARNING: In June 2019, node auto-upgrade will be enabled by default for newly created clusters and node pools. To disable it, use the `--no-enable-autoupgrade` flag.
I suggest we disable auto upgrade when creating these clusters, in addition to adding retry logic.
I think there are two different "master upgrades" that display the same within the google cloud console.
sleep-echo workflow
They both show the following message in https://console.cloud.google.com
:
Upgrading cluster master.
The values shown below are going to change soon.
There are commands to perform the software upgrade via the command line or through the google cloud console. My assumption is that secondary master upgrade is when google is upgrading the master server to a bigger server, since workflows seem to run fine after this upgrade. I haven't been able to find a flag or command to perform this upgrade manually.
I see - so it's not necessarily due to a software version upgrade, it may be responding to the number of pods and migrating to a larger master.
So, disabling the automatic software upgrade wouldn't prevent GKE from migrating to a bigger master.
Do you have a screenshot on where you see that message?
Thanks. Your findings match up with the behavior described in this issue, and it seems to be the designed/intended behavior: https://github.com/terraform-providers/terraform-provider-google/issues/3385.
And that makes sense - the master is dynamic and may be unavailable while it's scaling up to manage more nodes. That can certainly happen on other cloud platforms, so we should handle it better :)
I came across some documentation that seems related to "Upgrading cluster master": https://github.com/GoogleCloudPlatform/gke-stateful-applications-demo#deployment-steps
It refers to the k8s master as the control plane
. If you spawn a cluster with 6 nodes it may go
into a 'RECONCILING' status as the control plane's instance size may be increased.
I found that you can see this status in gcloud container clusters list
while the cluster master is being upgraded.
I tried staring a cluster with 6 small nodes and only saw failures due to javascript taking too long. This is more of a work-around than actually addressing the core issue of being able to handle cluster upgrades.
I added the calrissian --eval-timeout 600
flag when running the sleep-echo workflow
on a google cloud cluster started with 6 small nodes and a larger pool it ran to completion with no errors. The larger pool scaled up during the job.
The --eval-timeout
flag was added to avoid the Long-running script killed
exception. The cwltool logic that checks this timeout just waits a number of seconds:
https://github.com/common-workflow-language/cwltool/blob/a9078408aafdee0e0f7c27954fb80ef1a0c39e9c/cwltool/sandboxjs.py#L176-L186
Running the full 24 exome sample workflow failed even with the above tweaks. There were a handful of errors but the first one was:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/cwltool/executors.py", line 264,
in runner job.run(runtime_context)
File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 593,
in run completion_result = self.wait_for_kubernetes_pod()
File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 372,
in wait_for_kubernetes_pod return self.client.wait_for_completion()
File "/usr/local/lib/python3.6/site-packages/calrissian/k8s.py", line 151,
in wait_for_completion monitor.remove(pod)
File "/usr/local/lib/python3.6/site-packages/calrissian/k8s.py", line 268,
in remove PodMonitor.pod_names.remove(pod.metadata.name) ValueError: list.remove(x): x not in list
I was able to look at logging around this error via the the Google Cloud Console.
In StackDriver Error Reporting I clicked view logs
for the error which takes you to Stackdriver Logging. I cleared the limit
field to +/- 1 hour
and cleared out the error specific filter.
The above ValueErorr occured after a successful trim
command where we tried to remove it.
Side Thought: I noticed in the Stackdriver Logging each log entry has a good deal of data stored with it. I was wondering if the quantity of logging we do to stdout/stderr might be part of the larger problem.
Attempting to summarize different types errors seen in google cloud error logs.
Connection Errors
Blocking IO Errors
kubernetes.client.rest.ApiException 400, 404 and 500
ValueError: list.remove(x): x not in list
Errors within subprocesses
Errors collecting output
Javascript Timeout errors
--eval-timeout 600
these can be avoided.Some of these exceptions seem to be side effects of others. For example ValueError: list.remove(x): x not in list
seems to be a side effect of the wait_for_completion
exiting early.
https://github.com/Duke-GCB/calrissian/blob/a249143087d3409414c474d92a04f0be8b794f3b/calrissian/k8s.py#L132-L134
Typically w.stream
blocks until we run w.stop()
inside wait_for_completion
. It looks like when we lose connection to the k8s server it does return(or throws exception). This in turn causes an early call to PodMonitor.cleanup()
that eventually results in the ValueError
above.
researched errors and tried to find source/cause
Attempting to summarize different types errors seen in google cloud error logs.
Connection Errors
- [Errno 111] Connection refused',
- urllib3.exceptions.ProtocolError 'Connection broken: IncompleteRead'
Should be addressed in k8s.py - anywhere we make an API call to kubernetes, check for errors and possibly retry
Blocking IO Errors
- BlockingIOError: [Errno 11] write could not complete without blocking at flush (/usr/local/lib/python3.6/logging/init.py:978) at emit (/usr/local/lib/python3.6/logging/init.py:998)
Could be caused by
The sys.stdout and sys.stderr are reopened as tee subprocesses, one of which could be blocking
kubernetes.client.rest.ApiException 400, 404 and 500
- Example read_namespaced_pod_log but also seen elsewhere
- ValueError: list.remove(x): x not in list
Errors within subprocesses
- java.io.EOFException: Unexpected end of ZLIB input stream at uk.ac.babraham.FastQC.Utilities.MultiMemberGZIPInputStream.read (MultiMemberGZIPInputStream.java:68) at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext (FastQFile.java:138) at uk.ac.babraham.FastQC.Sequence.FastQFile.next (FastQFile.java:125) at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run (AnalysisRunner.java:76) at java.lang.Thread.run (Thread.java:748)
Seems to be downstream of an earlier failure in concat-gz-files. I believe that script needs to set pipefail: https://github.com/bespin-workflows/exomeseq-gatk4/blob/develop/tools/concat-gz-files.cwl
Errors collecting output
- File "", line 1, in : cwltool.errors.WorkflowException: ../input/workflow/sleep-echo.cwl:41:7: Expression evaluation error: at raise_from (/usr/local/lib/python3.6/site-packages/future/utils/init.py:400) at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:298) at collect_output (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:737) at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:657)
- cwltool.errors.WorkflowException: Error collecting output for parameter 'output_qc_report': at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:168) at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:612) at run (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:191) at runner (/usr/local/lib/python3.6/site-packages/cwltool/executors.py:264)
Errors collecting output usually indicate an earlier failure. We should make sure early tools are returning failure codes when they fail (concat-gz-output: pipefail) and revisit this.
Javascript Timeout errors
- These are from cwltool logic. By using the
--eval-timeout 600
these can be avoided.
This is odd. The expressions are simple but taking over 20s to run? $(inputs.message + '.txt')
. We notice in sandboxjs.py that there is a lot of overhead to running nodejs (separate process for each invocation and multiple threads). For now we can work around by raising the timeout but this seems like it's not very efficient.
Investigating this currently and have implemented some retry logic with tenacity. Have temporarily disabled deleting pods in an attempt to sort out the noise. I've found that in the sleep-echo workflow of 175 parallel tasks, the workflow doesn't finish because some of the pods get deleted as part of the node upgrade.
[sleep-echo-5-pod-llwypgls] follow_logs start
[sleep-echo-5-pod-llwypgls] 0: hello
[sleep-echo-5-pod-llwypgls] 1: hello
[sleep-echo-5-pod-llwypgls] 2: hello
[sleep-echo-5-pod-llwypgls] 3: hello
[sleep-echo-5-pod-llwypgls] 4: hello
[sleep-echo-5-pod-llwypgls] 5: hello
[sleep-echo-5-pod-llwypgls] 6: hello
[sleep-echo-5-pod-llwypgls] 7: hello
[sleep-echo-5-pod-llwypgls] 8: hello
[sleep-echo-5-pod-llwypgls] 9: hello
[sleep-echo-5-pod-llwypgls] rpc error: code = Unknown desc = Error: No such container: 7e9bb62be850fe96f94636cbec4c92559e9b65cb31b31bdd697f3746ed40cc48
[sleep-echo-5-pod-llwypgls] follow_logs end
[sleep-echo-5-pod-llwypgls] follow_logs start
Retrying calrissian.k8s.KubernetesClient.follow_logs in 5.0 seconds as it raised (404)
Reason: Not Found
Initially this was baffling, but on further thought, kubernetes resources such as deployments, jobs, and replicasets are responsible for creating pods if they are deleted manually. So calrissian should be no different. If the pod is deleted out from under it, it should be responsible for re-submitting that pod.
I considered looking into pod disruption budgets, restartPolicy, or disabling auto-scaling, but at the end of the day, raw pods are not expected to be resilient. they're the lowest level object, and we should either be acting as a manager of pods or using a Job if we want to delegate that responsibility. We decided in #23 to submit pods rather than jobs since it seemed a better fit.
When testing a moderately parallelized job on GKE (24 exome samples, 48 files), the cluster API becomes unresponsive and triggers failures in the workflow. It's not failing because of the computations.
This is easily reproducible with a simple but highly parallel worfklow that generates 175 parallel containers: https://gist.github.com/dleehr/afdcde15aef9d727fd5226beddef126d
The above sleep-echo workflow doesn't seem to cause problems on a development cluster (Docker Desktop for Mac), so I'm surprised it can overwhelm a GKE cluster
Some ideas:
--parallel
. While running,docker stats
reported over 2000 PIDs for the calrissian container that orchestrates the workflow. This PID count includes processes and kernel threads. It's not clear that this is a problem but it may be a symptom.