Open gitskamzn opened 5 months ago
Hi @gitskamzn, Thanks for reaching out with this issue.
Hi @vigh-m - This is an ongoing issue that I see happening with instances using Bottlerocket OS 1.19.2 - 1. 19.4
versions. One of the instance types i see this in is m6i.4xlarge
. These instances are part of EKS clusters that are launched using IaC.
Containerd is enabled for these instances and the host-containers settings are added as explained here --> https://bottlerocket.dev/en/os/1.19.x/api/settings/host-containers/#container_source
I can see the 10 seconds timeout here : https://github.com/bottlerocket-os/bottlerocket/blob/0d6b452690a1944f1362be7ec90a83a90753a7a3/sources/host-ctr/cmd/host-ctr/main.go#L371 which seems to reflect in the host container logs as well.
I see failed to delete container task. error: failed to delete task: context deadline exceeded:unknown
error appear exactly 10 seconds after container task exited
log message.
Just for scenarios like this one, is adding gracePeriod
a good idea when a context deadline exceeds or somehow ensure/force clean the task.
Also, have we considered to parameterize the retries and 45 second delays? It was probably looked into but not implemented. https://github.com/bottlerocket-os/bottlerocket/issues/1430.
Hi, So, host-ctr was not intended to have complex orchestration strategies and parameters. It's recommended to use an orchestrator to enable those features. Can you share more detailed logs around the container tasks that you are seeing? Also;
Please see below:
What is the size of this container image? - 220 MB
What is the expected time for the task being executed on this container? - It executes 3 apiclient commands: update check, get and set version-lock. Takes about 4 seconds for a successful run. Failures start after its unable to cleanp the task and container in allocated 10 seconds.
Do you see issues with the admin and control container provided by Bottlerocket? - Host Container Update Method being used: https://bottlerocket.dev/en/os/1.19.x/update/methods/in-place/
Do you loop after you are done with your apiclient
commands? Otherwise, the container will exit and systemd (which is what we use to execute host-ctr
) will try to restart the process since it exited.
Is this something you want to keep running every time? Why not setting all these configurations through user data, or even a bootstrap container with mode = once
?
Image I'm using: Bottlerocket K8s 1.29 VERSION_ID: 1.19.4 Build_ID=4f0a078e
What I expected to happen: Expect container task to start every iteration.
What actually happened: Container task fails to start after it encounters failure.
How to reproduce the problem:
I can see the container task start when a new node comes up. During its regular run, it fails to start again and complains the task already exists. This appears to happen when the previous deletion fails. I can see in logs the deletion of task fail with
context deadline exceeded
error.level=error msg=failed to delete container task" error="failed to delete task: context deadline exceeded: unknown"
level=error msg=failed to cleanup container" error="cannot delete running task taskname: failed precondition"
Subsequent runs are unable to get the task started as it seems to exist already.