Open jsolbrig opened 2 years ago
@jsolbrig Thanks for the detailed report. In addition to the workaround you listed, the k8 module has a continue_on_error
parameter designed for this case. When set to true
, a resource action that fails, including waiting on that resource, will not block progress on other resources. I believe between your workaround and the continue_on_error
parameter, this should provide the flexibility to address this issue?
I think you're right that between continue_on_error
and my workaround it should be enough to address the issue, but if that is the way it should work, it should probably be documented. It's not too hard to figure out how to work around the problem, but I'd hate for everyone who tries to use the package the same way as me to need to fumble with it long enough to find the workaround.
Something just feels unintuitive about needing to "apply" the resources twice, once to get them to apply and once to watch them. It feels like it should be possible to do that in one task. Maybe a flag that tells the task to use the workaround in the background rather than blocking on each resource would be useful?
Thank you for the suggestion. At the moment, I think we would not consider changing the wait behavior as there are multiple ways to solve this problem. We can keep this open, though, to see if there is more interest in this. We may revisit the idea of different wait behavior in the future.
If there is a way the documentation could be made more clear, we would certainly welcome a PR for that.
I'd like to express my interest in that issue. It appears that this module applies resources one by one and hangs/fails if there are dependencies between them, even if there are dependencies between objects of the same type eg. pod requiring other pod to be running. Using continue_on_error
seems to be problematic as in this case it will significantly increase execution time as module will wait for timeout on each resource with dependencies defined later on and it will require further inspection of resources to determine if they were actually deployed correctly and that no other errors occurs.
I would say that more intuitive behavior would be for module to apply all the resources and then wait for each of them to be created properly.
SUMMARY
I am attempting to deploy an application that contains multiple resources using
kubernetes.core.k8s
where the definition is provided by a call to thekubernetes.core.kustomize
lookup. Once the resources have been submitted to the cluster, I want to wait for its Pods to become ready before moving on to the next step.What I'm noticing is that, when creating multiple resources, the "wait" occurs after each is created. This blocks subsequent resources from being created if there is a problem with an earlier resource. This leads to errors in deployment when one resource is dependent another resource, but the dependency hasn't been created yet.
To solve this, I'd like to suggest "wait" should not start watching resources until the entire set of resources in the definition has been applied to the cluster.
There is a reasonable workaround for this problem that I added to the end of this post.
ISSUE TYPE
COMPONENT NAME
k8s
ANSIBLE VERSION
COLLECTION VERSION
CONFIGURATION
OS / ENVIRONMENT
AlmaLinux 8
STEPS TO REPRODUCE
I've put together a simple case that encounters this problem and one that works as expected. The only difference is the order of the resources in the k8s resource definition.
Note that I'm replacing the use of
kubernetes.core.kustomize
here to simplify things. I'm using afile
lookup to read the resource definitions from a file that contains multiple resources to simulate the use ofkustomize
which produces a list of resources.Problematic Example
This example creates a Pod that tries to mount a ConfigMap. The Pod is applied to the cluster before the ConfigMap. This results in the Pod remaining in a
ContainerCreating
state. The task waits for the Pod to become "Ready" and never creates the ConfigMap.bad_playbook.yaml
containing:wait_bug.yaml
containing:apiVersion: v1 kind: ConfigMap metadata: name: wait-bug-cm namespace: wait-bug data: test.txt: "this is a test"
Create the wait-bug namespace
Attempt to create the test application
This will time out after 120 seconds
PLAY [Stand up wait bug test] *****
TASK [Gathering Facts] **** ok: [localhost]
TASK [Create wait bug test application] *** fatal: [localhost]: FAILED! => { "changed": true, "duration": 120, "method": "create", "result": {
...
MSG:
Resource creation timed out
PLAY RECAP **** localhost : ok=1 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
Check for the Pod
It will remain in a state of
ContainerCreating
Check for the ConfigMap
It won't exist
Working Example
This example creates a ConfigMap then creates a Pod that mounts that ConfigMap. Both resources get created correctly because they are defined in the order that they are needed. Order should not matter.
working_playbook.yaml
containing:wait_working.yaml
containing:apiVersion: v1 kind: Pod metadata: name: wait-bug-test namespace: wait-bug spec: containers:
cat /mnt/test.txt
; if [[ $? -ne 0 ]]; then exit 1; fi; sleep 1; done"] volumeMounts:EXPECTED RESULTS
I would expect the ConfigMap and Pod to deploy correctly regardless of the order that they are defined in.
ACTUAL RESULTS
Only one of the two examples worked correctly.
Workaround
A workaround that appears to work well is to change the playbook to have two tasks:
wait: no
wait: yes
This will work regardless of the order in which the resources are defined.