bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.69k stars 512 forks source link

Provide visibility into Bootstrap Container behaviors - exit status, time, etc? #3811

Open diranged opened 7 months ago

diranged commented 7 months ago

What I'd like: We use a few bootstrap containers on startup - some of them label hosts, others handle custom max-pods calculations, etc. Because booting new hosts is more important to us than the occasional host that might boot "incorrectly configured", we choose to mark these as essential=false to ensure that we are never blocked in booting new capacity. (This decision has saved us many outages).

The thing is ... once your host is booted, you have no idea whether or not the Bootstrap scripts worked. You can scroll through the Journald Logs, but thats it. You don't know how long a host waited to execute a script, how long it took to pull down an image, or what the exit codes were.

We want to keep track of the number of Bootstrap Containers that start up and fail so that we can alert on that, but not block the booting process. In an ideal world, we would also have some method for getting metrics on how long it took these containers to run, which would help us optimize our new-host boot time (but that's really for extra credit).

Preferred Behavior

When I think about how to approach this - I feel like the most natural thing is for each Bootstrap Container to become a "condition" on the node - so that a simple kubectl describe node ... will get you information on it. From there, metrics can be collected about which nodes have which conditions on them, and teams can develop any alerting or behaviors they need.

Any alternatives you've considered:

We first went down the path of trying to use the Node Problem Detector with this configuration (below) - but discoverd that it really only tails logs from the moment it starts up, so it cannot react to logs that existed before it comes up .. therefore it cannot have visibility into the Bootstrap Containers.

      bottlerocket-bootstrap-containers.json: |
        { 
          "plugin": "custom",
          "pluginConfig": {
            "invoke_interval": "5m",
            "timeout": "1m",
            "max_output_length": 80,
            "concurrency": 1
          },

          "source": "bottlerocket-bootstrap-containers",
          "conditions": [
            {
              "type": "BootstrapContainerFail",
              "reason": "NoFailure",
              "message": "Bootstrap Containers started successfully",
            }
          ],
          "rules": [
            {
              "type": "permanent",
              "condition": "BootstrapContainerFail",
              "reason": "ContainerStartFailure",
              "path": "/home/kubernetes/bin/log-counter",
              "timeout": "3m"
              "args": [
                "--journald-source=systemd",
                "--log-path=/var/log/journal",
                "--lookback=20m",
                "--delay=5m",
                "--count=5",
                "--pattern=Failed to start bootstrap container.*",
              ],
            }
          ]
        }
yeazelm commented 7 months ago

Thanks for cutting this issue @diranged. I think there are some useful features that could be added to bootstrap containers. I haven't looked deeply at conditions but is the expectation there would be one for each bootstrap container, regardless of if it is marked essential or not? I think metrics about success, time, and logging output all seem like reasonable things as well. We'll take that as a feature request to enhance the observability of bootstrap containers.

diranged commented 7 months ago

Thanks for cutting this issue @diranged. I think there are some useful features that could be added to bootstrap containers. I haven't looked deeply at conditions but is the expectation there would be one for each bootstrap container, regardless of if it is marked essential or not? I think metrics about success, time, and logging output all seem like reasonable things as well. We'll take that as a feature request to enhance the observability of bootstrap containers.

Just off the top of my head - I'd like to see a condition per Bootstrap Container. I could be convinced otherwise though ... but that seems cleanest to me.