Closed dinooliva closed 5 years ago
Oops, looks like this PR is addressing the same issue with https://github.com/census-instrumentation/opencensus-service/pull/597 which I submitted earlier. I had actually tried a similar approach at the very beginning by caching the first generated TimeSeries, however, it's not going to work properly in some cases, including:
1) as @rghetia also mentioned, to detect if a remote server has restarted, one needs to compare the current value with the last observed value 2) the program also needs to clean up the cache, as timeseries can come and go. For example, in our use case we mainly use ocagent to scrape cadvisor metrics, which includes a lot of container spec metrics the lifecycles of which are tied to the containers but not cadvisor. 3) you might observe new labels from subsequent runs. take a look at the following example: scrape run 1
# HELP container_memory_failures_total Cumulative count of memory allocation failures.
# TYPE container_memory_failures_total counter
container_memory_failures_total{failure_type="pgfault",id="/",image="",name="",scope="container"} 760009
from an Appender's perspective, only the label failure_type
and container
can be observed, as any empty labels with empty labels are removed before hitting the Add/AddFast method of an appender
then in the 2nd run, you get a new one with some empty labels are filled
# HELP container_memory_failures_total Cumulative count of memory allocation failures.
# TYPE container_memory_failures_total counter
container_memory_failures_total{failure_type="pgfault",id="/",image="",name="",scope="container"} 760010
container_memory_failures_total{failure_type="pgfault",id="/docker/6a26e59b145922c813444ca59b46cc3186e6e64064744f973b59f81dbd913fa8",image="somedocker-image",name="some-name",scope="container"} 1.877625e+06
now the timeseries will have all 4 labels in the 2nd run. and it will be challenging to link them back. not only the label values, the label orders are also important.
4) for any failed scrape (say remote endpoint returning some temp errors), prometheues scraper library will still feed data with the metric labels from cache and value NaN to the appender. the appender also needs to filtered out such data
You might want to take a look at the PR I have submitted and take the works from there. I have also added quite a lot tests trying to capture all these corner cases as well as a redesigned EndToEnd test which is able to produce more stable results and simulate edge cases like remote server is not responding properly
@fivesheep - thank you for the detailed response and my apologies in they delay in responding. As @rghetia mentioned, I took this approach because we'd like to reuse the logic across both the text and proto version of the prometheus receiver. In response to your particular comments:
You're right that looking at the previous value is probably a better approach to detecting a reset than looking at the initial value - that should straightforward to add to this approach.
There's not really a perfect solution to this problem but adding a ttl field (or similarly) should be straightforward to add.
That's a very interesting point that I need to understand it better so I can't say whether or not it's feasible to add to this approach - I'll take a look over your pr shortly.
That's a great point too - that should also be straightforward to add.
@dinooliva it does make sense to have a generic solution which is able to adjust income metrics values for different Prometheus protocols . I am not sure if it's possible to make it one more layer up to be a more generic solution for any metrics receivers which has the need to adjust values, with a simple interface like:
type MetricsAdjuster interface {
AdjustMetric(*data.MetricsData) *data.MetricsData
}
and make it as an option for the metric receiver factory.
other than that in https://github.com/census-instrumentation/opencensus-service/pull/597 I have also refactored the metricbuilder and add a layer called metric family which makes the code cleaner and easier to understand, you might also want to take a look at it.
@fivesheep - I spent some time looking over your solution and your notes. I think it may be best to go with a combined solution. My module is strictly for adjusting start times and values based on discovered initial timeseries and resets.
Dealing with newly discovered labels and empty histograms/summaries are not really issues that the metrics adjuster should deal with (I just have to make sure that the signature calculation works appropriately for empty label values) but those issues do need to be dealt with in the metricsbuilder, as you have done.
I plan to add a couple of commits that deal with the issues 1 & 2. Issue 3 I think should be handled by the metricsbuilder and issue 4 I think can also be dealt with using your solution.
Merging #598 into master will increase coverage by
0.28%
. The diff coverage is85.13%
.
@@ Coverage Diff @@
## master #598 +/- ##
==========================================
+ Coverage 68.61% 68.89% +0.28%
==========================================
Files 91 92 +1
Lines 5939 6064 +125
==========================================
+ Hits 4075 4178 +103
- Misses 1650 1668 +18
- Partials 214 218 +4
Impacted Files | Coverage Δ | |
---|---|---|
receiver/prometheusreceiver/internal/ocastore.go | 72.41% <100%> (+0.98%) |
:arrow_up: |
...iver/prometheusreceiver/internal/metricsbuilder.go | 100% <100%> (ø) |
:arrow_up: |
receiver/prometheusreceiver/metrics_receiver.go | 72.13% <100%> (+1.95%) |
:arrow_up: |
...eceiver/prometheusreceiver/internal/transaction.go | 89.06% <76.19%> (-7.31%) |
:arrow_down: |
...er/prometheusreceiver/internal/metrics_adjuster.go | 85.21% <85.21%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 35b1e1c...84db0f8. Read the comment docs.
PTAL
This pr now addresses the issue with restarts and adds more unit tests. Once this pr has been merged, I will submit another pr to deal with cleaning up the cache (I don't want this pr to get too complicated).
@songy23 - ptal
@fivesheep
We've merged this code, which resolves issue 1 of the 4 issues that you originally noted. Of the other 3 issues:
Issue 2 needs to be handled by the metrics adjuster and I'm working on a solution to it. I had been planning to cleanup at the job-level rather than at the timeseries level but that doesn't seem to work for your use-case (cadvisor) so I will look into handling individual timeseries.
Issue 3 is separate from the metrics adjuster and your code already deals with it.
Issue 4 is also separate from the metrics adjuster and, again, your code deals with it.
Would you be able to update your pr to remove the issue 1 and issue 2 related code while keeping the rest?
@dinooliva sure will do. are we still going to have a release with this two PRs, or release will only be in the new project?
@fivesheep I think it would make sense to make a new release with these changes but I'll have to consult with the project owners to see what their current process is.
@songy23 - any thoughts?
We can still make patch releases for small improvement, bug fixes etc. @flands Could you help?
That are a few pending PRs that started before we made the announcement that we were in "maintenance mode". Assuming that this can wait a bit let's close those before cutting a new release.
Code is complete - ptal. Currently working on updating/adding unit tests.