TritonDataCenter / containerpilot

A service for autodiscovery and configuration of applications running in containers
Mozilla Public License 2.0
1.13k stars 135 forks source link

Duplicate metric events in Telemetry endpoint #536

Closed eliasbrange closed 6 years ago

eliasbrange commented 6 years ago

We use containerpilot -putmetric to gather certain fluctuating metrics such as memory usage and free disk space. We noticed that after running a container in production for some time the /metrics endpoint generates over 100k lines with lines on the form

containerpilot_events{code="Metric",source="metric|value"} 1

Since the metrics are fluctuating this means that we get a few new events every 15 seconds or so. My guess is that this will continue indefinitely until the container is restarted. For counters that can only ever go up this also means a new event line every time you do putmetric.

My question is, is this expected behaviour? Filtering in e.g. InfluxDB on source is now impossible, and the endpoint is getting slower and slower due to the massive amount of events that are returned.

Edit: I have experimented with version 3.4.2, 3.5.0 and 3.6.0 and they seem to behave the same.

jwreagor commented 6 years ago

Thanks for reporting this @eliasbrange.

Initially the behavior you're describing doesn't sound like what I'd expect on a Prom formatted endpoint. I would expect a single metric counter to increase monotonically and not output multiple lines or counters with a single event value.

Could you paste a snippet of everything else being logged around when this starts (I don't need all the lines, obviously)?

Is it just a single metric named metric|value? Are any other custom metrics performing this way, possibly named in another manner?

A test configuration/setup is always helpful as well and helps me debug quicker but I'm quite sure I can reproduce this. Sounds like I only need to push metrics into ContainerPilot.

eliasbrange commented 6 years ago

I guess the config doesn't actually matter. And defining a metric in the config does not matter either I think. Anyways, here comes a little config.

{
  consul: "localhost:8500",
  logging: {
    level: "DEBUG",
    format: "default",
    output: "stdout"
  },
  telemetry: {
    port: 9090,
    interfaces: ["static:127.0.0.1"],
    metrics: [
      {
        namespace: "telemetry",
        subsystem: "test",
        name: "counter",
        help: "test",
        type: "counter"
      }
    ]
  }
}

Then I run the following commands with the following output:

$ curl --silent localhost:9090/metrics | grep telemetry_test_counter
> # HELP telemetry_test_counter test                                          
> # TYPE telemetry_test_counter counter                                       
> telemetry_test_counter 0                                                    

$ for i in {1..10}; do containerpilot -putmetric telemetry_test_counter=$i; done

$ bash-4.3# curl --silent localhost:9090/metrics | grep telemetry_test_counter
> containerpilot_events{code="Metric",source="telemetry_test_counter|1"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|10"} 1   
> containerpilot_events{code="Metric",source="telemetry_test_counter|2"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|3"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|4"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|5"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|6"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|7"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|8"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|9"} 1    
> # HELP telemetry_test_counter test                                          
> # TYPE telemetry_test_counter counter                                       
> telemetry_test_counter 55                                                   

As you can see, each call to -putmetric with a unique value results in a new row on the /metrics endpoint.

jwreagor commented 6 years ago

Thank you, your configs definitely helped because I wasn't seeing the issue immediately until I tried your setup. So yeah, you've found a bug and possibly a race condition because hammering the endpoint using a for loop causes multiple events to be created while sending them slower increments an existing event.

Notice below that telemetry_test_counter|1 equals 2. I wouldn't have noticed that without the for loop!

$ curl --silent localhost:9090/metrics | grep telemetry_test_counter
containerpilot_events{code="Metric",source="telemetry_test_counter|1"} 2
containerpilot_events{code="Metric",source="telemetry_test_counter|10"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|2"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|3"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|4"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|5"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|6"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|7"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|8"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|9"} 1
# HELP telemetry_test_counter test
# TYPE telemetry_test_counter counter
telemetry_test_counter 56
jwreagor commented 6 years ago

Looks like everything is working fine except logging. We're logging each unique counter event by value. This is especially prevalent when counting in varying increments which I didn't notice before.

$ for i in {1..10}; do containerpilot -config ./containerpilot.json5 -putmetric "telemetry_test_counter=1"; done
$ curl -s 172.16.20.7:4141/metrics | grep something_test
# HELP telemetry_test_counter total number of somethings
# TYPE telemetry_test_counter counter
telemetry_test_counter 10
containerpilot_events{code="Metric",source="telemetry_test_counter|1"} 10
jwreagor commented 6 years ago

We're reporting events through our event bus Prometheus collector. It looks like I can remove the value portion of the event name when collecting Metric events but I'm not sure of the repercussions of that. I believe that's only used for reporting so that's one option.

Another option would be to stop recording Metric events all together since end users use their Prometheus counter values themselves. Why do we need duplicate information through the endpoint? Do we really need to know about each Metric event if the counter or gauge is what we're ultimately deriving value out of?

Any thoughts? Going to think about this.

jwreagor commented 6 years ago

After adding a single if statement around Metric collection, I'm feeling good about option 2 and removing Metric event collection all together. In hindsight if feels a bit overzealous. Also, I don't believe we had many consumers of the telemetry endpoint in that regard so it's probably safe to remove those events.

eliasbrange commented 6 years ago

Thanks for the quick iterations on this one @cheapRoc. Much appreciated.

jwreagor commented 6 years ago

Just released in version 3.6.1, thanks again for the report!