bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.76k stars 516 forks source link

Host Container Unable to Create Container Task #3970

Open gitskamzn opened 5 months ago

gitskamzn commented 5 months ago

Image I'm using: Bottlerocket K8s 1.29 VERSION_ID: 1.19.4 Build_ID=4f0a078e

What I expected to happen: Expect container task to start every iteration.

What actually happened: Container task fails to start after it encounters failure.

How to reproduce the problem:

I can see the container task start when a new node comes up. During its regular run, it fails to start again and complains the task already exists. This appears to happen when the previous deletion fails. I can see in logs the deletion of task fail with context deadline exceeded error.

level=error msg=failed to delete container task" error="failed to delete task: context deadline exceeded: unknown" level=error msg=failed to cleanup container" error="cannot delete running task taskname: failed precondition"

Subsequent runs are unable to get the task started as it seems to exist already.

vigh-m commented 5 months ago

Hi @gitskamzn, Thanks for reaching out with this issue.

  1. Was this working on previous versions of Bottlerocket?
  2. Can you share some details of the container and instance type you are launching?
  3. Any data about how you are launching your containers?
gitskamzn commented 5 months ago

Hi @vigh-m - This is an ongoing issue that I see happening with instances using Bottlerocket OS 1.19.2 - 1. 19.4 versions. One of the instance types i see this in is m6i.4xlarge. These instances are part of EKS clusters that are launched using IaC. Containerd is enabled for these instances and the host-containers settings are added as explained here --> https://bottlerocket.dev/en/os/1.19.x/api/settings/host-containers/#container_source

I can see the 10 seconds timeout here : https://github.com/bottlerocket-os/bottlerocket/blob/0d6b452690a1944f1362be7ec90a83a90753a7a3/sources/host-ctr/cmd/host-ctr/main.go#L371 which seems to reflect in the host container logs as well. I see failed to delete container task. error: failed to delete task: context deadline exceeded:unknown error appear exactly 10 seconds after container task exited log message.

gitskamzn commented 5 months ago

Just for scenarios like this one, is adding gracePeriod a good idea when a context deadline exceeds or somehow ensure/force clean the task. Also, have we considered to parameterize the retries and 45 second delays? It was probably looked into but not implemented. https://github.com/bottlerocket-os/bottlerocket/issues/1430.

vigh-m commented 5 months ago

Hi, So, host-ctr was not intended to have complex orchestration strategies and parameters. It's recommended to use an orchestrator to enable those features. Can you share more detailed logs around the container tasks that you are seeing? Also;

  1. What is the size of this container image?
  2. What is the expected time for the task being executed on this container?
  3. Do you see issues with the admin and control container provided by Bottlerocket?
gitskamzn commented 5 months ago

Please see below:

  1. What is the size of this container image? - 220 MB

  2. What is the expected time for the task being executed on this container? - It executes 3 apiclient commands: update check, get and set version-lock. Takes about 4 seconds for a successful run. Failures start after its unable to cleanp the task and container in allocated 10 seconds.

  3. Do you see issues with the admin and control container provided by Bottlerocket? - Host Container Update Method being used: https://bottlerocket.dev/en/os/1.19.x/update/methods/in-place/

arnaldo2792 commented 5 months ago

Do you loop after you are done with your apiclient commands? Otherwise, the container will exit and systemd (which is what we use to execute host-ctr) will try to restart the process since it exited.

Is this something you want to keep running every time? Why not setting all these configurations through user data, or even a bootstrap container with mode = once?