RobustPerception / azure_metrics_exporter

Azure metrics exporter for Prometheus
Apache License 2.0
133 stars 69 forks source link

Added delay option to retrieve the metrics. #75

Closed vincentcreusot closed 11 months ago

vincentcreusot commented 4 years ago

It is used to avoid collecting data that has not fully converged. For target, group and tag configuration, we can set the delay_minutes parameter to specify more than the 3 default minutes.

brian-brazil commented 4 years ago

I wasn't aware that Azure had the convergence issues that CloudWatch has. Can you point me at any docs around this?

vincentcreusot commented 4 years ago

I wasn't aware that Azure had the convergence issues that CloudWatch has. Can you point me at any docs around this?

Sadly I didn't find any documentation, neither bug around that but I'm talking after experiencing it. A request for that : /subscriptions/${subscription_id}/resourceGroups/${RG}/providers/Microsoft.Web/serverFarms/${resource_name}/providers/microsoft.insights/metrics?aggregation=Total%2CAverage%2CMinimum%2CMaximum&api-version=2018-01-01&metricnames=MemoryPercentage

gives

{ "timeStamp": "2019-12-03T20:06:00Z", "total": 249.0, "average": 62.25, "minimum": 62.0, "maximum": 63.0 }, { "timeStamp": "2019-12-03T20:07:00Z", "total": 0.0, "average": 0.0, "minimum": 0.0, "maximum": 0.0 }, { "timeStamp": "2019-12-03T20:08:00Z", "total": 0.0, "average": 0.0, "minimum": 0.0, "maximum": 0.0 }, { "timeStamp": "2019-12-03T20:09:00Z", "total": 0.0, "average": 0.0, "minimum": 0.0, "maximum": 0.0 } So some zero values in the end. If you do the request later, the value at the timestamp that had a 0 has a value. Some metrics put zeros, some do the right thing and do not return values. I'm only talking about Azure in this case. We discovered this because we had alerts raising for CPU used regularly and we tried to reproduce this problem. I hope it answers your question.

brian-brazil commented 4 years ago

That sounds like it might actually be a bug rather than an expected behaviour, considering that noone else has reported it (whereas we heard of it quite a lot for Cloudwatch). Would you mind filing an issue with Azure to see if this is intentional?

louisfelix commented 4 years ago

Hi @brian-brazil ! This particular case with app service plan may be an Azure bug, it's true, but having the possibility to override the exporter hardcoded delay of 3 minutes is a life-saver here, and could also be in other situations. In the current case, using a delay of 5 minutes instead of 3 fixes the issue. I think it doesn't harm to have this flexibility.

brian-brazil commented 4 years ago

If Azure really works this way, then this needs to be changed for everyone - not as something users find out about the hard way. Thus we need to know if this is intentional.

vincentcreusot commented 4 years ago

I created a feedback demand through the Azure portal to see if someone will give me an explanation. And also a support ticket.

vincentcreusot commented 4 years ago

Helllo @brian-brazil , I poked them again on the support request which was planned to be answered in 8h, but still nothing. Continuing to poke them, but I don't know if I'll have an explanation. I totally understand your concern about adding code to maintain just to fix a bug.

brian-brazil commented 4 years ago

It's more that there's wide ranging consequences if this is how things work, and there'd be more to it than your PR. I've already dealt with this over in the cloudwatch exporter, and we could avoid that complexity here that'd be good.

louisfelix commented 4 years ago

Hi @brian-brazil , I follow-up the work of my colleague (@vincentcreusot). We got a few exchange with Azure support on the case we opened with them.

The case is now closed so here is a summary of what I can grab out from Azure support comments:

  • it looks like it may be due to the clustering aspect of the web app backend, I’ve opened a collaboration to that team, but unfortunately the fullest answer we may get is that this is an artifact of the backend processes and infrastructure.
  • when data values are not present yet azure monitor puts in the default value for the variable type. So if the variable is written as a string, the default string is just “” an empty string, but INTs cannot be “” they have to be a valid number, so its just 0.
  • unfortunately this seems to be part of the design of the system itself.

So on one hand, it looks like they do not consider this as a bug for AppServicePlan (Microsoft.Web/serverFarms); so it would be useful if the exporter adapt to that "feature" by supporting the possibility to override the currently hard-coded 3 minute delay.

On the other hand I understand it has only been observed on this specific resource type so far, and that the "conclusion" is not very solid with the information we were able to get.

Let me know what you think.

brian-brazil commented 4 years ago

Silently providing data that you know to be incorrect sounds like a bug to me. Is there a way to request that Azure Monitor not do that?

On the other hand I understand it has only been observed on this specific resource type so far, and that the "conclusion" is not very solid with the information we were able to get.

It sounds related to what we see with CloudWatch. Odd that it's only on one resource type.

louisfelix commented 4 years ago

Silently providing data that you know to be incorrect sounds like a bug to me. Is there a way to request that Azure Monitor not do that?

Request can be made on the azure feedback page. I made a search and found "our problem"!

I "voted" it up, hopefully others will do, but I have the feeling it can be long to see a fix (or even an answer).

PeterIttner commented 4 years ago

If we report Azure metrics from the past, wouldn't it be more correct to also include a prometheus timestamp with those metrics? There is the NewMetricWithTimestamp function for this in the prometheus library. Currently all Azure Metrics are reported as "NOW" but actually they are from a past point in time.

louisfelix commented 4 years ago

If we report Azure metrics from the past, wouldn't it be more correct to also include a prometheus timestamp with those metrics? There is the NewMetricWithTimestamp function for this in the prometheus library. Currently all Azure Metrics are reported as "NOW" but actually they are from a past point in time.

I agree with that: whether or not a new delay option is added to the exporter, I think NewMetricWithTimestamp should be used to reflect the currently hard-coded delay.

ChristianGottinger commented 4 years ago

I added a pullrequest #81 with metrices adapted to expose also timestamp.