bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.78k stars 518 forks source link

Enable baseline logging and metrics collection #2747

Open stmcginnis opened 1 year ago

stmcginnis commented 1 year ago

Problem Statement

A frequent request from Bottlerocket users has been to have an easy mechanism to ship logs for the Bottlerocket host to some form of off-host log collector. To help address compliance requirements, users need to collect system and application logs off-box for capturing historical data and for auditing. Desired requirements are:

Somewhat related to this, users also need the ability to collect system metrics from Bottlerocket hosts. Usage information such as CPU, memory, and disk activity is often useful when correlated with log data to get a full picture of what has happened. It is also needed for ongoing system monitoring and being able to watch for historical trends and unusual spikes.

Both of these requirements can be met today by building a privileged container to run on the system. This demands a technical level of operators that some users may not have. It also requires a level of operational overhead that many users will not be willing to take on, leading to resistance to adopting Bottlerocket as their cluster host OS.

While the two user requirements are not strictly related, it is possible to address both needs in one feature update to Bottlerocket. Therefore they are being evaluated together here.

Requirements

Non-Goals

As mentioned in the problem statement above, the request is to have all privileged events logged. This is not fully in place for Bottlerocket.

The current effort described in this document is to enable the log export part of that requirement. A separate, and likely ongoing, effort will be needed to ensure all privileged events in the Bottlerocket API are logged while also making sure no sensitive details like passwords are included. Part of this is currently in logdog, but would need to be extended.

General

Log Shipping

Metrics Reporting

Metrics should capture basic host system information:

Potential Solutions

Fluent-bit

Fluent-bit is a popular and flexible logging and metrics package that can be configured to provide a wide variety of input and output options.

Its main downside is that with many different plugins comes many different configuration options. If Bottlerocket were to abstract logging and metrics processing to be non-Fluent-bit specific, it would need to find a common, simple, set of configuration options that could be translated into fluent-bit’s plugin configuration.

This abstraction could be driven by the choice of log shipping support. S3 could be used as a target destination as a simple way to get logs off of the box since there are several options besides AWS S3 that provide an S3 interface. Collecting logs in an S3 bucket may be too simplistic to be useful for most users though, and would require additional work to then process those logs from S3 into a logging system they could use. This just shifts the complexity of running a log shipping container to requiring additional off-box processing.

Syslog has been used for a very long time, and has some well established software packages that could be used to collect the logs.

ElasticSearch or OpenSearch are more featureful options. Or TCP or HTTP endpoints could be exposed to allow scraping information - though this also requires additional work on the user’s end and opens up some concerns about security.

Fluent-bit has a very robust set of plugins, so the challenge may be to figure out which ones make the most sense for Bottlerocket and would be the most useful for end users.

Pros

Cons

Prometheus Node Exporter

Prometheus is a very popular open source monitoring and alerting tool. Metrics can be exported to Prometheus from Linux hosts using the Prometheus Node Exporter (https://github.com/prometheus/node_exporter). The tool is written in Go and would be fairly easy to add as a new package to Bottlerocket.

This could be used as a metrics export option in combination with a log shipping package. The downside is it would tie this capability tightly to Prometheus. That may or may not be what Bottlerocket users would want.

Pros

Cons

AWS CloudWatch Agent

Many hosts running as an AWS instance use the AWS CloudWatch agent to send logs and metrics to the AWS CloudWatch service. This is a package that can be installed via most package managers on Amazon Linux, Red Hat, Ubuntu, other Linux platforms. It would provide a very tight integration, with information that existing AWS users might expect when migrating from Amazon Linux to Bottlerocket.

The downside is: this is a very AWS-centric solution. If a user was running Bottlerocket outside of AWS, they would either need to have an AWS account to use CloudWatch, or they would need to go the existing route of deploying their own custom logging and metrics solution in a host container.

So not ideal, but this would be a valid option for a large number of users.

Pros

Cons

Logstash

Logstash has long been used in the ELK Stack (Elasticsearch, Logstash, and Kibana) for monitoring and visualization of events. It has a lot of history and its range of features make it very flexible to meet a lot of different needs.

The main drawback, and perhaps the non-starter for Bottlerocket inclusion, is it is no longer open source. Parts of Logstash are licensed under Apache 2.0, but parts are under the Elastic License. It is possible only the parts under the permissive Apache 2.0 license could be used, but there is the risk of the license scope changing based on past history.

Pros

Cons

Commercial Alternatives

Assuming open source agents for other commercial eventing products are available, they could potentially be included in Bottlerocket. Unless several options are included, it is likely whatever solution would be chosen would only be a preferred choice for a limited subset of Bottlerocket users.

Pros

Cons

Proposed Solution

As a preface to the proposed solution, two key data points should be kept in mind:

Given these, the proposed solution, for now, would be to add the AWS CloudWatch agent to the Bottlerocket image.

This package will be added to all aws-* variants. It would be possible to add to other variants, but until there is user feedback that this would be desired it won't be included initially.

Even with the addition of an AWS CloudWatch agent, the project should continue to solicit feedback from the community to determine if there is another, more general, solution that could be added to address the needs of those that do not wish to use an AWS service and prefer not to create their own custom host container.

Additionally, though users can deploy their own host container log processing solution today, there is little to no documentation to help guide them through how to do so. A new project website is under development in the bottlerocket-os/project-website repo. An issue should be filed there to add this documentation to make it easier for users to understand what would be required should they choose to go this route.

CloudWatch Agent Settings

The AWS CloudWatch agent will not be enabled by default. If a user chooses to use CloudWatch, we would expose a new setting:

[aws.cloudwatch]
enabled = true

The initial implementation will default to AWS region based on the instance metadata (when run in the AWS cloud) or to the configured AWS information (based on the settings.aws.config information). All other agent configuration settings will be defaults.

Based on user feedback once this functionality is available, additional configuration settings for the agent will be considered.

Action Plan

webern commented 1 year ago

Does it need to be added to the host? Can't it run as a container workload, e.g. Kubernetes daemonset?

stmcginnis commented 1 year ago

Does it need to be added to the host? Can't it run as a container workload, e.g. Kubernetes daemonset?

Great question @webern, and this is being evaluated right now. I've updated the description to reflect that this is an idea that is under consideration. After exploring the needs a bit more I will update this issue with a full proposal of what use cases we would want to address and how we could go about doing it.

stmcginnis commented 1 year ago

Updated to included the current thought process and plan forward.

Definitely would like feedback from folks on this, especially with regards to next steps in getting a more general solution in place that would be useful.

mballoni commented 1 year ago

We are planning on some monitoring about our EKS health and one scenario that we faced recently was new nodes not being able to join the Cluster. There are many root causes (our was DNS, a change we were testing) however irrespective of the root cause we would like to be alerted when a node is having trouble to communicate with the control plane.

Having said that also having logs/metrics when this kind of trouble appears (kubelet communication issue?) would help immensely.

is it under this issue scope?

mikn commented 10 months ago

Additional input (if you are still looking for that) is that we are looking at using systemd-journald together with its remote capabilities to collect audit events on the systemd-journald-audit.socket on the host level to build a low-dependency and robust audit logging system. If it was possible to forward audit-class events from the API server to the audit socket in journald, that would be a great feature.