Stackdriver / stackdriver-prometheus

Prometheus support for Stackdriver
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
19 stars 12 forks source link

Fixing nil pointer dereference #8

Closed m1k3 closed 6 years ago

m1k3 commented 6 years ago

which can happen in the _sum metric is missing.

Some prometheus exporters are not exposing the _sum metric of a summary which will cause the parser to crash. I believe it is the root cause of #5.

I added a test based on TestEmptySum that exposes the issue. It is just a quick copy&paste of TestPointExtractor to expose the issue and I'd welcome some feedback to make it less noisy 😉 .

I also added a check to make sure that the metric.Summary.SampleSum is not nil when we try to decrement the resetValue from it. I'm not sure it is the best way to solve the issue, but I ran the code in my cluster and it seems to solve the problem.

juliusv commented 6 years ago

Drive-by comment from someone without context (happen to be subscribed to this repo because I think someone added me): wouldn't you also need to add a similar check for _count and potentially other fields? Missing either _count or _sum is not a valid Prometheus summary metric, so I'm not sure I understand why one would have to be checked but not the other.

m1k3 commented 6 years ago

You're probably right about other fields. For now I would prefer to validate whether or not my approach is valid within the context of what is happening while parsing. Once we determine that we can extend it to other fields.

I ran into this, because I was trying to expose Dropwizard metrics in the prometheus format using this example and the maintainer was very clear that he does not want to expose the _sum metric because Dropwizard does not provide data for it and that it is a completely valid thing to do with the Prometheus format to omit the _sum metric.

juliusv commented 6 years ago

@m1k3 Hmmm, as far as I can see that example only exposes a normal counter metric (with _total suffix), not a summary metric. Am I missing something? Pointers at related discussions welcome...

m1k3 commented 6 years ago

The example is not exposing a summary, but in principle it is possible to expose Dropwizard histograms as Prometheus summaries but if you look at the code the _sum is missing. This is exactly what I added to the example and sure enough there was no _sum. Have a look at my PR for the discussion.

juliusv commented 6 years ago

Interesting, I had never heard that a summary without _sum is considered valid in Prometheus, but ok. You can still encounter /metrics endpoints that have other faults like missing the _count series for a summary or histogram, and you probably don't want to crash on them either (as you mentioned, you plan on checking more fields later).

jkohen commented 6 years ago

Thanks for the PR! Can you address @juliusv 's feedback?

@fabxc FYI

m1k3 commented 6 years ago

@jkohen I'm not sure which part of the feedback are you referring to. Do you mean adding the check for SampleCount?

jkohen commented 6 years ago

Yes, exactly.

On Thu, Jul 12, 2018, 13:33 Michal Olah notifications@github.com wrote:

@jkohen https://github.com/jkohen I'm not sure which part of the feedback are you referring to. Do you mean adding the check for SampleCount?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Stackdriver/stackdriver-prometheus/pull/8#issuecomment-404590246, or mute the thread https://github.com/notifications/unsubscribe-auth/AG0TaH1_MDqiBIzwaEAqsxjTm33KKLz7ks5uF4iEgaJpZM4VCt4G .

m1k3 commented 6 years ago

@jkohen I added the check for the missing _count and updated the test to expose when such a situation would occur.

m1k3 commented 6 years ago

Awesome, thx. I thought we'd cleanup the test I added to point out the problem more clearly (there some extra variables in there and a more-or-less duplicated helper function) before we merge, but too late now I guess.