Structured log messages

amirbenun commented 7 months ago

Motivation

Every log Cloudbeat outputs is a JSON object in a .ndjson file. On managed elastic stacks, filebeat reads these log files and indexes them into elasticsearch, this data can be the source of interesting aggregations which will later be used for visualizations and alerts to allow proactive monitoring.

One issue we currently have is that many of the logs are formatted, meaning parameters are part of the log message instead of introducing them in a dedicated field of this json object. This change is effectively done by changing the logging calls from: log.Errorf("Could not get encryption for bucket %s. Error: %v", *bucket.Name, encryptionErr) into: log.Errorw("Could not get encryption for S3 bucket", "bucket", *bucket.Name, "error", encryptionErr)

This change brings another challenge in the area of naming the new fields. With this change "bucket" and "error" are now keys in the json object that indexed to ES and therefore should comply with ECS. There is not an easy way to name those fields, we need to individually assess every log and find the best matching ECS fields that represent its parameters. In the example about, a better representation will be: log.Errorw("Could not get encryption for S3 bucket", "aws.s3.bucket.name", *bucket.Name, "error.message", encryptionErr)

Definition of done

[ ] Refactor cloudbeat to have structured logs instead of formatted logs where possible

Related tasks/epics

-

orestisfl commented 7 months ago

log.Errorw("Could not get encryption for S3 bucket", "bucket", *bucket.Name, "error", encryptionErr)

Just a note, it's easy to make mistakes when using this kind of syntax, e.g. log.Errorw("Could not get encryption for S3 bucket", *bucket.Name, "error", encryptionErr)

Having linters for that can be beneficial even though I am not sure they can catch 100% of the potential mistakes.

oren-zohar commented 7 months ago

Thanks @amirbenun, I think it's crucial to consider that the current log format in beats is designed to align with the Fleet UI's display requirements. Altering this structure might lead to compatibility issues with the Fleet UI.

To achieve our objective of categorizing error messages into different types for effective monitoring - distinguishing critical issues (like high resource usage, restarts, memory problems, etc.) from expected errors (such as permission issues or missing resources) - I recommend incorporating an 'error type' field. This aligns with the ECS error schema, which can provide a standardized approach for classifying and analyzing errors. This addition should allow us to maintain the existing log structure while enhancing our logging capabilities for more targeted and efficient troubleshooting and monitoring.

I think we should also explore how other teams handle this issue? Do other components in elastic track error types as well?

orestisfl commented 7 months ago

To achieve our objective of categorizing error messages into different types for effective monitoring - distinguishing critical issues (like high resource usage, restarts, memory problems, etc.) from expected errors (such as permission issues or missing resources) - I recommend incorporating an 'error type' field. This aligns with the ECS error schema, which can provide a standardized approach for classifying and analyzing errors. This addition should allow us to maintain the existing log structure while enhancing our logging capabilities for more targeted and efficient troubleshooting and monitoring.

An error type field is probably not enough, for example if we want to filter logs by aws.s3.bucket.name.

How about we include both the "free-form" strings and the structured fields in the log message? So a combination of

 log.Errorf("Could not get encryption for bucket %s. Error: %v", *bucket.Name, encryptionErr)

and

 log.Errorw("Could not get encryption for S3 bucket", "aws.s3.bucket.name", *bucket.Name, "error.message", encryptionErr)

Here's an easy way to do it using Amir's example in the code (code will need additional changes for having an error field as well):

diff --git a/resources/providers/awslib/s3/provider.go b/resources/providers/awslib/s3/provider.go
index add35694..f54d5b97 100644
--- a/resources/providers/awslib/s3/provider.go
+++ b/resources/providers/awslib/s3/provider.go
@@ -79,26 +79,28 @@ func (p Provider) DescribeBuckets(ctx context.Context) ([]awslib.AwsResource, er
    bucketsRegionsMapping := p.getBucketsRegionMapping(ctx, clientBuckets.Buckets)
    for region, buckets := range bucketsRegionsMapping {
        for _, bucket := range buckets {
+           logger := p.log.With("aws.s3.bucket.name", *bucket.Name, "cloud.region", region)
+
            // Getting the bucket encryption, policy, versioning  and public access block is not critical for the rest
            //  of the flow, so we should keep describing the bucket even if getting these objects fails.
            sseAlgorithm, encryptionErr := p.getBucketEncryptionAlgorithm(ctx, bucket.Name, region)
            if encryptionErr != nil {
-               p.log.Errorf("Could not get encryption for bucket %s. Error: %v", *bucket.Name, encryptionErr)
+               logger.Errorf("Could not get encryption for bucket %s. Error: %v", *bucket.Name, encryptionErr)
            }

            bucketPolicy, policyErr := p.GetBucketPolicy(ctx, bucket.Name, region)
            if policyErr != nil {
-               p.log.Errorf("Could not get bucket policy for bucket %s. Error: %v", *bucket.Name, policyErr)
+               logger.Errorf("Could not get bucket policy for bucket %s. Error: %v", *bucket.Name, policyErr)
            }

            bucketVersioning, versioningErr := p.getBucketVersioning(ctx, bucket.Name, region)
            if versioningErr != nil {
-               p.log.Errorf("Could not get bucket versioning for bucket %s. Err: %v", *bucket.Name, versioningErr)
+               logger.Errorf("Could not get bucket versioning for bucket %s. Err: %v", *bucket.Name, versioningErr)
            }

            publicAccessBlockConfiguration, publicAccessBlockErr := p.getPublicAccessBlock(ctx, bucket.Name, region)
            if publicAccessBlockErr != nil {
-               p.log.Errorf("Could not get public access block configuration for bucket %s. Err: %v", *bucket.Name, publicAccessBlockErr)
+               logger.Errorf("Could not get public access block configuration for bucket %s. Err: %v", *bucket.Name, publicAccessBlockErr)
            }

            result = append(result, BucketDescription{

For reference:

// With creates a child logger and adds structured context to it. Fields added
// to the child don't affect the parent, and vice versa.
func (l *Logger) With(args ...interface{}) *Logger {

orestisfl commented 7 months ago

Or perhaps even:

diff --git a/resources/providers/awslib/s3/provider.go b/resources/providers/awslib/s3/provider.go
index add35694..021078bd 100644
--- a/resources/providers/awslib/s3/provider.go
+++ b/resources/providers/awslib/s3/provider.go
@@ -79,26 +79,32 @@ func (p Provider) DescribeBuckets(ctx context.Context) ([]awslib.AwsResource, er
    bucketsRegionsMapping := p.getBucketsRegionMapping(ctx, clientBuckets.Buckets)
    for region, buckets := range bucketsRegionsMapping {
        for _, bucket := range buckets {
+           eLog := func(format string, err error) {
+               p.log.
+                   With("aws.s3.bucket.name", *bucket.Name, "cloud.region", region, "error.message", err).
+                   Errorf(format, *bucket.Name, err)
+           }
+
            // Getting the bucket encryption, policy, versioning  and public access block is not critical for the rest
            //  of the flow, so we should keep describing the bucket even if getting these objects fails.
            sseAlgorithm, encryptionErr := p.getBucketEncryptionAlgorithm(ctx, bucket.Name, region)
            if encryptionErr != nil {
-               p.log.Errorf("Could not get encryption for bucket %s. Error: %v", *bucket.Name, encryptionErr)
+               eLog("Could not get encryption for bucket %s. Error: %v", encryptionErr)
            }

            bucketPolicy, policyErr := p.GetBucketPolicy(ctx, bucket.Name, region)
            if policyErr != nil {
-               p.log.Errorf("Could not get bucket policy for bucket %s. Error: %v", *bucket.Name, policyErr)
+               eLog("Could not get bucket policy for bucket %s. Error: %v", policyErr)
            }

            bucketVersioning, versioningErr := p.getBucketVersioning(ctx, bucket.Name, region)
            if versioningErr != nil {
-               p.log.Errorf("Could not get bucket versioning for bucket %s. Err: %v", *bucket.Name, versioningErr)
+               eLog("Could not get bucket versioning for bucket %s. Err: %v", versioningErr)
            }

            publicAccessBlockConfiguration, publicAccessBlockErr := p.getPublicAccessBlock(ctx, bucket.Name, region)
            if publicAccessBlockErr != nil {
-               p.log.Errorf("Could not get public access block configuration for bucket %s. Err: %v", *bucket.Name, publicAccessBlockErr)
+               eLog("Could not get public access block configuration for bucket %s. Err: %v", publicAccessBlockErr)
            }

            result = append(result, BucketDescription{

amirbenun commented 7 months ago

I like both of the ideas, one disadvantage of changing the original error message is that the fleet UI doesn't show any parameters besides the message. Agreeing on that, any relevant ECS compliant context that will be added on log messages can serve us.

elastic / cloudbeat