LeonAdato / xk6-output-statsd

k6 extension to output real-time test metrics using StatsD.
Apache License 2.0
24 stars 9 forks source link

Metrics not delivering to CloudWatch: The parameter ... must be shorter than 1025 characters. #8

Closed AnrichVS closed 9 months ago

AnrichVS commented 9 months ago

Hi,

I'm using K6 with this extension to delivery metrics to AWS CloudWatch Metrics via the CloudWatch agent.

I noticed that many of the metrics aren't delivery to CloudWatch (obvious when looking at the data on CloudWatch). I tested locally using Graphite, and here it is clear that all metrics are delivering.

I inspected the logs on the CloudWatch agent, and found that many requests are failing with HTTP 400 Bad Request errors. The specific errors look as follows:

<ErrorResponse xmlns=""http://monitoring.amazonaws.com/doc/2010-08-01/"">"
  <Error>
    <Type>Sender</Type>
    <Code>InvalidParameterValue</Code>
    <Message>The parameter MetricData.member.13.Dimensions.member.4.Value must be shorter than 1025 characters.</Message>
  </Error>
  <RequestId>12345</RequestId>
</ErrorResponse>

I realize that this is very specific to CloudWatch, but this extension is the officially recommended approach to delivery metrics to CloudWatch, and thus it feels most appropriate to report the issue here.

By setting the "aws_sdk_log_level" to "LogDebugWithHTTPBody" in the CloudWatch agent config, I am able to see the GZIP compressed Post body content for the failing requests, but I haven't been able to successfully decompress any of these, so I can't see the offending metric value.

Any suggestions regarding how I can further debug this, or potentially solve the issue?

AnrichVS commented 9 months ago

After much frustration, I've figured out what the issue is. This problem is not related to this extension and thus I'm closing this issue. I will however describe the issue and solution below in case someone is struggling with the same thing.

Problem

The specific dimension value (MetricData.member.13.Dimensions.member.4.Value above) that is longer than 1025 characters, is the name dimension. If the K6_STATSD_ENABLE_TAGS environment variable is set to "true", K6 will add a dimension for each system tag (and all the tags you specified yourself).

The name tag contains the full URL of the request, including all GET parameters. In the case of the application I'm load testing, the URL can be very long, and longer than 1025 characters in this case. This then causes relatively silent failures of metric deliveries to CloudWatch Metrics, resulting in gaps in metrics.

Solution

You have a few options:

  1. Use the systemTags option to disable the name tag. I don't want to do this as I still want the URL as a dimension, but I'm not interested in the GET parameters.
  2. Override the name tag with the URL excluding the GET parameters. This is what I'm going to do.

Investigation

Figuring this out was quite complex, which is unfortunate since a bit of extra log verbosity on the CloudWatch agent would've prevented hours and hours of wasted time.

Due to this, I'm going to describe how I discovered this, as this can be generalised to other similar issues.

Debug logs

The first step was to enable debug mode in the CloudWatch agent, as well as the most verbose log output possible. This was done with the following agent config:

  "agent": {
    "debug": true,
    "aws_sdk_log_level": "LogDebugWithHTTPBody|LogDebugWithRequestErrors"
  }

In addition to this, I also ran K6 with the --http-debug flag.

The HTTP 400 response from AWS CloudWatch metrics was spotted in the CloudWatch agent logs. The issue is that the logged POST body is GZIP compressed. I tried to deflate it (using Python and ZLIB), but I couldn't get this working.

Investigating the network traffic

Since the logs didn't contain enough usable information to find the issue, my next idea was to intercept the network traffic from the CloudWatch agent to inspect the API requests that are made, and specifically the PutMetricData request body.

Unfortunately I wasted quite some time on tools that weren't meant for this. I essentially used Docker to proxy the traffic originating from the CloudWatch Agent container through an NGinx container, only to realize you can't log a request's POST body by default in NGinx. This lead me down a whole rabbit hole that I won't detail here.

Luckily I discovered HTTP Toolkit. This was the right tool for the job. I used its Docker network interception functionality to inspect traffic on the CloudWatch agent container, and this is where I finally saw that the name dimension contains a very long URL, resulting in a HTTP 400 response from CloudWatch Metrics.