Enable baseline logging and metrics collection

stmcginnis commented 1 year ago

Problem Statement

A frequent request from Bottlerocket users has been to have an easy mechanism to ship logs for the Bottlerocket host to some form of off-host log collector. To help address compliance requirements, users need to collect system and application logs off-box for capturing historical data and for auditing. Desired requirements are:

Make sure the system logs privileged events of interest (such as API settings changes)
Make sure that system logs are sent somewhere - all logs or minimally the privileged events
Make sure customers can apply filters and alerts around log events, so that they can detect unauthorized changes

Somewhat related to this, users also need the ability to collect system metrics from Bottlerocket hosts. Usage information such as CPU, memory, and disk activity is often useful when correlated with log data to get a full picture of what has happened. It is also needed for ongoing system monitoring and being able to watch for historical trends and unusual spikes.

Both of these requirements can be met today by building a privileged container to run on the system. This demands a technical level of operators that some users may not have. It also requires a level of operational overhead that many users will not be willing to take on, leading to resistance to adopting Bottlerocket as their cluster host OS.

While the two user requirements are not strictly related, it is possible to address both needs in one feature update to Bottlerocket. Therefore they are being evaluated together here.

Requirements

Non-Goals

As mentioned in the problem statement above, the request is to have all privileged events logged. This is not fully in place for Bottlerocket.

The current effort described in this document is to enable the log export part of that requirement. A separate, and likely ongoing, effort will be needed to ensure all privileged events in the Bottlerocket API are logged while also making sure no sensitive details like passwords are included. Part of this is currently in logdog, but would need to be extended.

General

Configuration should be allowed via user-data for out of the box enablement of functionality on initial boot.
Settings should be general enough to be applied at large scale across a large fleet of Bottlerocket nodes (support for GitOps style management and deployment).
The goal should be to provide a minimum viable in-the-box option for logging and metrics. Anything beyond the most basic scenarios should be continued to be supported by custom privileged containers.
Any included packages should be under an open source licenses recognized by the OSI.

Log Shipping

Logging must be accessible in a format that could be consumed by other logging and alerting software. This does not mean it needs to be in that native format, but that the format is able to be parsed and processed through some intermediary layer to allow it to be useful.
Logging implementation should not be an AWS-only solution. Long term plans for Bottlerocket are to allow it to grow for on-premises, edge, and non-AWS cloud usage. The choice of logging support must be flexible enough to be adopted in these environments.

Metrics Reporting

Metrics should capture basic host system information:

CPU usage
Memory usage
Disk activity
Network interface activity

Potential Solutions

Fluent-bit

Fluent-bit is a popular and flexible logging and metrics package that can be configured to provide a wide variety of input and output options.

Its main downside is that with many different plugins comes many different configuration options. If Bottlerocket were to abstract logging and metrics processing to be non-Fluent-bit specific, it would need to find a common, simple, set of configuration options that could be translated into fluent-bit’s plugin configuration.

This abstraction could be driven by the choice of log shipping support. S3 could be used as a target destination as a simple way to get logs off of the box since there are several options besides AWS S3 that provide an S3 interface. Collecting logs in an S3 bucket may be too simplistic to be useful for most users though, and would require additional work to then process those logs from S3 into a logging system they could use. This just shifts the complexity of running a log shipping container to requiring additional off-box processing.

Syslog has been used for a very long time, and has some well established software packages that could be used to collect the logs.

ElasticSearch or OpenSearch are more featureful options. Or TCP or HTTP endpoints could be exposed to allow scraping information - though this also requires additional work on the user’s end and opens up some concerns about security.

Fluent-bit has a very robust set of plugins, so the challenge may be to figure out which ones make the most sense for Bottlerocket and would be the most useful for end users.

Pros

Popular open source solution
Flexible log and metrics processing with many available plugins

Cons

Each output plugin requires very different config settings to configure
Metrics reporting limited to CPU and memory (at least for CloudWatch metrics)
- “Note: Right now, only cpu and mem metrics can be sent to CloudWatch.”
- Full metrics are available with node metrics exporter
Lowest common denominator logging target may be too simple to be useful (e.g. blobs in S3 bucket)
Requires additional host packages to run at the Bottlerocket host level (openssl at a minimum)

Prometheus Node Exporter

Prometheus is a very popular open source monitoring and alerting tool. Metrics can be exported to Prometheus from Linux hosts using the Prometheus Node Exporter (https://github.com/prometheus/node_exporter). The tool is written in Go and would be fairly easy to add as a new package to Bottlerocket.

This could be used as a metrics export option in combination with a log shipping package. The downside is it would tie this capability tightly to Prometheus. That may or may not be what Bottlerocket users would want.

Pros

Popular open source platform support

Cons

Only supports exporting metrics, would need separate solution for logs
Prometheus specific solution

AWS CloudWatch Agent

Many hosts running as an AWS instance use the AWS CloudWatch agent to send logs and metrics to the AWS CloudWatch service. This is a package that can be installed via most package managers on Amazon Linux, Red Hat, Ubuntu, other Linux platforms. It would provide a very tight integration, with information that existing AWS users might expect when migrating from Amazon Linux to Bottlerocket.

The downside is: this is a very AWS-centric solution. If a user was running Bottlerocket outside of AWS, they would either need to have an AWS account to use CloudWatch, or they would need to go the existing route of deploying their own custom logging and metrics solution in a host container.

So not ideal, but this would be a valid option for a large number of users.

Pros

Feature parity with Amazon Linux and other general distro hosts
Majority of Bottlerocket users are running on AWS, provides expected integration for these users
Full logs and metrics reporting
Open source package that can cleanly be added as a single package to distro
Can be used with metal and VMware variants, similar to things like IAM Roles Anywhere support

Cons

Not as flexible for non-AWS use cases

Logstash

Logstash has long been used in the ELK Stack (Elasticsearch, Logstash, and Kibana) for monitoring and visualization of events. It has a lot of history and its range of features make it very flexible to meet a lot of different needs.

The main drawback, and perhaps the non-starter for Bottlerocket inclusion, is it is no longer open source. Parts of Logstash are licensed under Apache 2.0, but parts are under the Elastic License. It is possible only the parts under the permissive Apache 2.0 license could be used, but there is the risk of the license scope changing based on past history.

Pros

Well-established log shipping solution for Linux

Cons

Not fully open source

Commercial Alternatives

Assuming open source agents for other commercial eventing products are available, they could potentially be included in Bottlerocket. Unless several options are included, it is likely whatever solution would be chosen would only be a preferred choice for a limited subset of Bottlerocket users.

Pros

Several well-integrated solutions with robust feature sets

Cons

Risk of commercial license infringement
Not a valid choice for users looking for a fully open source solution

Proposed Solution

As a preface to the proposed solution, two key data points should be kept in mind:

Users are able to deploy a host container today with a logging and metrics solution exactly tailored to their needs
The large number of Bottlerocket users are on the AWS cloud

Given these, the proposed solution, for now, would be to add the AWS CloudWatch agent to the Bottlerocket image.

This package will be added to all aws-* variants. It would be possible to add to other variants, but until there is user feedback that this would be desired it won't be included initially.

Even with the addition of an AWS CloudWatch agent, the project should continue to solicit feedback from the community to determine if there is another, more general, solution that could be added to address the needs of those that do not wish to use an AWS service and prefer not to create their own custom host container.

Additionally, though users can deploy their own host container log processing solution today, there is little to no documentation to help guide them through how to do so. A new project website is under development in the bottlerocket-os/project-website repo. An issue should be filed there to add this documentation to make it easier for users to understand what would be required should they choose to go this route.

CloudWatch Agent Settings

The AWS CloudWatch agent will not be enabled by default. If a user chooses to use CloudWatch, we would expose a new setting:

[aws.cloudwatch]
enabled = true

The initial implementation will default to AWS region based on the instance metadata (when run in the AWS cloud) or to the configured AWS information (based on the settings.aws.config information). All other agent configuration settings will be defaults.

Based on user feedback once this functionality is available, additional configuration settings for the agent will be considered.

Action Plan

File GitHub issue in the project-website repo to have clear instructions for deploying a custom host container for Fluent-bit that can be configured however the user requires (https://github.com/bottlerocket-os/project-website/issues/29)
Add settings and package for the AWS CloudWatch agent (#2434)
Continue to collect community feedback and requirements for a common, non-AWS logging and metrics solution
If/when support is added for other cloud providers, look at adding settings and package to enable those as well (e.g. Azure Monitor) for tight integration with the chosen cloud provider platform

webern commented 1 year ago

Does it need to be added to the host? Can't it run as a container workload, e.g. Kubernetes daemonset?

stmcginnis commented 1 year ago

Does it need to be added to the host? Can't it run as a container workload, e.g. Kubernetes daemonset?

Great question @webern, and this is being evaluated right now. I've updated the description to reflect that this is an idea that is under consideration. After exploring the needs a bit more I will update this issue with a full proposal of what use cases we would want to address and how we could go about doing it.

stmcginnis commented 1 year ago

Updated to included the current thought process and plan forward.

Definitely would like feedback from folks on this, especially with regards to next steps in getting a more general solution in place that would be useful.

mballoni commented 1 year ago

We are planning on some monitoring about our EKS health and one scenario that we faced recently was new nodes not being able to join the Cluster. There are many root causes (our was DNS, a change we were testing) however irrespective of the root cause we would like to be alerted when a node is having trouble to communicate with the control plane.

Having said that also having logs/metrics when this kind of trouble appears (kubelet communication issue?) would help immensely.

is it under this issue scope?

mikn commented 11 months ago

Additional input (if you are still looking for that) is that we are looking at using systemd-journald together with its remote capabilities to collect audit events on the systemd-journald-audit.socket on the host level to build a low-dependency and robust audit logging system. If it was possible to forward audit-class events from the API server to the audit socket in journald, that would be a great feature.

bottlerocket-os / bottlerocket