census-instrumentation / opencensus-service

OpenCensus service allows OpenCensus libraries to export to an exporter service rather than having to link vendor-specific exports.
Apache License 2.0
153 stars 63 forks source link

Fix issues with initial start time and restarts in the Prometheus receiver. #598

Closed dinooliva closed 5 years ago

dinooliva commented 5 years ago

Code is complete - ptal. Currently working on updating/adding unit tests.

fivesheep commented 5 years ago

Oops, looks like this PR is addressing the same issue with https://github.com/census-instrumentation/opencensus-service/pull/597 which I submitted earlier. I had actually tried a similar approach at the very beginning by caching the first generated TimeSeries, however, it's not going to work properly in some cases, including:

1) as @rghetia also mentioned, to detect if a remote server has restarted, one needs to compare the current value with the last observed value 2) the program also needs to clean up the cache, as timeseries can come and go. For example, in our use case we mainly use ocagent to scrape cadvisor metrics, which includes a lot of container spec metrics the lifecycles of which are tied to the containers but not cadvisor. 3) you might observe new labels from subsequent runs. take a look at the following example: scrape run 1

# HELP container_memory_failures_total Cumulative count of memory allocation failures.
# TYPE container_memory_failures_total counter
container_memory_failures_total{failure_type="pgfault",id="/",image="",name="",scope="container"} 760009

from an Appender's perspective, only the label failure_type and container can be observed, as any empty labels with empty labels are removed before hitting the Add/AddFast method of an appender

then in the 2nd run, you get a new one with some empty labels are filled

# HELP container_memory_failures_total Cumulative count of memory allocation failures.
# TYPE container_memory_failures_total counter
container_memory_failures_total{failure_type="pgfault",id="/",image="",name="",scope="container"} 760010
container_memory_failures_total{failure_type="pgfault",id="/docker/6a26e59b145922c813444ca59b46cc3186e6e64064744f973b59f81dbd913fa8",image="somedocker-image",name="some-name",scope="container"} 1.877625e+06

now the timeseries will have all 4 labels in the 2nd run. and it will be challenging to link them back. not only the label values, the label orders are also important.

4) for any failed scrape (say remote endpoint returning some temp errors), prometheues scraper library will still feed data with the metric labels from cache and value NaN to the appender. the appender also needs to filtered out such data

You might want to take a look at the PR I have submitted and take the works from there. I have also added quite a lot tests trying to capture all these corner cases as well as a redesigned EndToEnd test which is able to produce more stable results and simulate edge cases like remote server is not responding properly

rghetia commented 5 years ago

597 covers more cases than #598. However, in #598 MetricAdjuster is independent of scrapping which is useful another envoy prometheus receiver which converts from protobuf (instead of text) to opencensus-proto.

dinooliva commented 5 years ago

@fivesheep - thank you for the detailed response and my apologies in they delay in responding. As @rghetia mentioned, I took this approach because we'd like to reuse the logic across both the text and proto version of the prometheus receiver. In response to your particular comments:

  1. You're right that looking at the previous value is probably a better approach to detecting a reset than looking at the initial value - that should straightforward to add to this approach.

  2. There's not really a perfect solution to this problem but adding a ttl field (or similarly) should be straightforward to add.

  3. That's a very interesting point that I need to understand it better so I can't say whether or not it's feasible to add to this approach - I'll take a look over your pr shortly.

  4. That's a great point too - that should also be straightforward to add.

fivesheep commented 5 years ago

@dinooliva it does make sense to have a generic solution which is able to adjust income metrics values for different Prometheus protocols . I am not sure if it's possible to make it one more layer up to be a more generic solution for any metrics receivers which has the need to adjust values, with a simple interface like:

type MetricsAdjuster interface {
    AdjustMetric(*data.MetricsData) *data.MetricsData
}

and make it as an option for the metric receiver factory.

other than that in https://github.com/census-instrumentation/opencensus-service/pull/597 I have also refactored the metricbuilder and add a layer called metric family which makes the code cleaner and easier to understand, you might also want to take a look at it.

dinooliva commented 5 years ago

@fivesheep - I spent some time looking over your solution and your notes. I think it may be best to go with a combined solution. My module is strictly for adjusting start times and values based on discovered initial timeseries and resets.

Dealing with newly discovered labels and empty histograms/summaries are not really issues that the metrics adjuster should deal with (I just have to make sure that the signature calculation works appropriately for empty label values) but those issues do need to be dealt with in the metricsbuilder, as you have done.

I plan to add a couple of commits that deal with the issues 1 & 2. Issue 3 I think should be handled by the metricsbuilder and issue 4 I think can also be dealt with using your solution.

codecov[bot] commented 5 years ago

Codecov Report

Merging #598 into master will increase coverage by 0.28%. The diff coverage is 85.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #598      +/-   ##
==========================================
+ Coverage   68.61%   68.89%   +0.28%     
==========================================
  Files          91       92       +1     
  Lines        5939     6064     +125     
==========================================
+ Hits         4075     4178     +103     
- Misses       1650     1668      +18     
- Partials      214      218       +4
Impacted Files Coverage Δ
receiver/prometheusreceiver/internal/ocastore.go 72.41% <100%> (+0.98%) :arrow_up:
...iver/prometheusreceiver/internal/metricsbuilder.go 100% <100%> (ø) :arrow_up:
receiver/prometheusreceiver/metrics_receiver.go 72.13% <100%> (+1.95%) :arrow_up:
...eceiver/prometheusreceiver/internal/transaction.go 89.06% <76.19%> (-7.31%) :arrow_down:
...er/prometheusreceiver/internal/metrics_adjuster.go 85.21% <85.21%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 35b1e1c...84db0f8. Read the comment docs.

dinooliva commented 5 years ago

PTAL

This pr now addresses the issue with restarts and adds more unit tests. Once this pr has been merged, I will submit another pr to deal with cleaning up the cache (I don't want this pr to get too complicated).

dinooliva commented 5 years ago

@songy23 - ptal

dinooliva commented 5 years ago

@fivesheep

We've merged this code, which resolves issue 1 of the 4 issues that you originally noted. Of the other 3 issues:

Issue 2 needs to be handled by the metrics adjuster and I'm working on a solution to it. I had been planning to cleanup at the job-level rather than at the timeseries level but that doesn't seem to work for your use-case (cadvisor) so I will look into handling individual timeseries.

Issue 3 is separate from the metrics adjuster and your code already deals with it.

Issue 4 is also separate from the metrics adjuster and, again, your code deals with it.

Would you be able to update your pr to remove the issue 1 and issue 2 related code while keeping the rest?

fivesheep commented 5 years ago

@dinooliva sure will do. are we still going to have a release with this two PRs, or release will only be in the new project?

dinooliva commented 5 years ago

@fivesheep I think it would make sense to make a new release with these changes but I'll have to consult with the project owners to see what their current process is.

@songy23 - any thoughts?

songy23 commented 5 years ago

We can still make patch releases for small improvement, bug fixes etc. @flands Could you help?

pjanotti commented 5 years ago

That are a few pending PRs that started before we made the announcement that we were in "maintenance mode". Assuming that this can wait a bit let's close those before cutting a new release.