contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.78k stars 230 forks source link

StatsD metrics parsing errors after upgrade to v1.7.0 #434

Closed seedifferently closed 1 year ago

seedifferently commented 1 year ago

We just upgraded to Faktory Enterprise v1.7.0 (from the contribsys docker image), and our Cloudwatch statsd collection agent started logging a constant slew of parsing errors.

Our statsd.toml looks like:

[statsd]
  location = "localhost:8125"
  namespace = "faktory"

Sample of logs from statsd agent:

2023-05-18T03:52:01Z E! Error: splitting '|', Unable to parse metric: faktory.throttle.lock:9|c|c:f6ac8d4e923a45bbbbcce94b7937f170-2157570088
2023-05-18T03:52:01Z E! Error: splitting '|', Unable to parse metric: faktory.failures:18585|g|c:f6ac8d4e923a45bbbbcce94b7937f170-2157570088
2023-05-18T03:52:01Z E! Error: splitting '|', Unable to parse metric: faktory.dead:0|g|c:f6ac8d4e923a45bbbbcce94b7937f170-2157570088
2023-05-18T03:52:01Z E! Error: splitting '|', Unable to parse metric: faktory.ops.memory:22002704|g|c:f6ac8d4e923a45bbbbcce94b7937f170-2157570088
2023-05-18T03:52:01Z E! Error: splitting '|', Unable to parse metric: faktory.enqueued.default:0|g|c:f6ac8d4e923a45bbbbcce94b7937f170-2157570088

No errors before we upgraded from v1.6.2 to v1.7.0.

Does v1.7.0 need additional settings in our statsd.toml to maintain the v1.6.2 behavior?

mperham commented 1 year ago

1.7 upgraded the dogstatsd client. Is it possible you need to upgrade the agent? Exactly what software and version is the agent?

seedifferently commented 1 year ago

Our logs say we're running AmazonCloudWatchAgent 1.247359.0 which seems to be latest(ish): https://github.com/aws/amazon-cloudwatch-agent/releases

We do a docker pull public.ecr.aws/cloudwatch-agent/cloudwatch-agent:latest with each redeploy to stay current with their releases.

mperham commented 1 year ago

I guess I don’t understand the error, is that line malformed? If so, how?

seedifferently commented 1 year ago

This is getting beyond my areas of expertise, but the cloudwatch agent statsd docs say:

CloudWatch supports the following StatsD format: MetricName:value|type|@sample_rate|#tag1:value...

However, I'm now also seeing that the DogStatsD protocol v1.2 added a new container ID field.

Looking at another example from our logs, my guess is the CW Agent's lack of support for this field is what's tripping it up:

2023-05-18T18:13:07Z E! Error: parsing sample rate, , it must be in format like: @0.1, @0.5, etc. Ignoring sample rate for line: faktory.throttle.lock:10|c|c:96eba0d873fa4c2ea751083e66d99e77-2157570088

Assuming that's a solid guess, I'll see if there's anything that I can do on my end to get the CW Agent to handle that better -- but might there be a way to disable utilization of that field on Faktory's side?

mperham commented 1 year ago

I would open an issue with the Cloudwatch agent repo/team and ask what their advice is. It's possible a lot of people are having this same problem, not just with Faktory. Will they support proprietary protocol extensions like this, or at least have their stream reader parse and discard the extra data?

mperham commented 1 year ago

You can try starting Faktory with DD_ORIGIN_DETECTION_ENABLED=false and see if that disables reporting the containerId.

seedifferently commented 1 year ago

It looks like DD_ORIGIN_DETECTION_ENABLED=false does the trick! Thank you for your assistance.