Azure / azure-sdk-for-go

This repository is for active development of the Azure SDK for Go. For consumers of the SDK we recommend visiting our public developer docs at:
https://docs.microsoft.com/azure/developer/go/
MIT License
1.58k stars 815 forks source link

[azmetrics] Inconsistent Metric Values from azmetrics.QueryResources() with Batches of 4+ Resources #22757

Open zmoog opened 4 months ago

zmoog commented 4 months ago

Bug Report

Context

I am collecting metrics for 10 Microsoft.KeyVault/vaults.

What happened?

If I call azmetrics.QueryResources() with a batch of 1-3 resources, I get the same data points I see on Azure Portal.

However, I don't get the same values I see in Azure Portal if I try to get metrics values for the same resource in a batch request with 4+ resources.

For example, with a batch of 1-3 resources, I always get two time series values for each resource. Starting with batches of 4+ resources, the number of time series values in the response varies at each request (0-2).

In the following example, I collect the metrics values in two ways:

$ go run main.go                                                                                                                                                                                                                                         

Ready to go!
----------------------------------------------------
Querying SINGLE resources
----------------------------------------------------
.../providers/Microsoft.KeyVault/vaults/kv1-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv2-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv3-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv4-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv5-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv6-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv7-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv8-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv9-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv10-oeefrt7tlykau
timeseries 2
----------------------------------------------------
Querying resources as a GROUP
----------------------------------------------------
.../providers/Microsoft.KeyVault/vaults/kv1-oeefrt7tlykau
timeseries 1
.../providers/Microsoft.KeyVault/vaults/kv2-oeefrt7tlykau
timeseries 1
.../providers/Microsoft.KeyVault/vaults/kv3-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv4-oeefrt7tlykau
timeseries 2
.../providers/Microsoft.KeyVault/vaults/kv5-oeefrt7tlykau
timeseries 0
.../providers/Microsoft.KeyVault/vaults/kv6-oeefrt7tlykau
timeseries 0
.../providers/Microsoft.KeyVault/vaults/kv7-oeefrt7tlykau
timeseries 0
.../providers/Microsoft.KeyVault/vaults/kv8-oeefrt7tlykau
timeseries 1
.../providers/Microsoft.KeyVault/vaults/kv9-oeefrt7tlykau
timeseries 1
.../providers/Microsoft.KeyVault/vaults/kv10-oeefrt7tlykau
timeseries 2

I get a different number of time series, depending if the same resource is in a batch of 1-3 or 4+ resources.

I repeated this test multiple times. The values for the "SINGLE" case never changed, while the values for the "GROUP" case changed on every call.

What did you expect or want to happen?

For the same resource, I expect azmetrics.QueryResources() to always return the same values, whether it's the only resource ID in the batch or one of the 50 supported resource IDs.

How can we reproduce it?

I created the gist https://gist.github.com/zmoog/fcede6fcbe5ba11f9275c40a58eea38d with:

Anything we should know about your environment.

Additional info

I see the same behavior when calling the API endpoint using cURL. See https://github.com/zmoog/public-notes/issues/81 for more details.

Questions:

github-actions[bot] commented 4 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jlichwa @RandalliLama @schaabs.

gracewilcox commented 4 months ago

Hi @zmoog! I think your issue will be fixed by setting the QueryResourcesOptions.Top field to a higher number. If a filter is specified, the service defaults to 10 records to retrieve per resource ID in the request. This is probably the cause of your throttling issues.

I repo'ed your code locally, and when I set Top to a higher number, the TimeSeries was consistent between the individual and group query.

options := azmetrics.QueryResourcesOptions{
        Aggregation: ptr("Count"),
        StartTime:   ptr("2024-04-16T07:18:13.001Z"),
        EndTime:     ptr("2024-04-16T07:19:13.001Z"),
        Filter:   ptr("ActivityType eq '*' AND ActivityName eq '*' AND StatusCode eq '*' AND StatusCodeClass eq '*'"),
        Interval: ptr("PT1M"),
        Top:      to.Ptr(int32(50)),
    }
github-actions[bot] commented 4 months ago

Hi @zmoog. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

axw commented 4 months ago

@gracewilcox apologies if I'm missing something obvious, but:

  1. the docs state that the default for Top is 10, and that this applies per resource ID
  2. in the example there are fewer than 10 time series being returned per resource ID

So why should we need to increase Top if there are fewer than 10 time series per resource? And given that Top should apply per resource, why does it matter if they're queried in bulk vs. one at a time?

zmoog commented 4 months ago

@gracewilcox, thank you for replying!

The QueryResourcesOptions.Top plays a big role in the number of records in the QueryResources() response.

As you said, the Top option documentation reports:

The maximum number of records to retrieve per resource ID in the request. Valid only if the filter is specified. Defaults to 10.

However, it seems more like an option that applies to the whole batch and not per resource ID.

In my test case, I have:

All these key vaults are unused (I create them for testing), so I only get two records per resource because we only have two unique combinations of dimension values:

CleanShot 2024-04-22 at 10 11 44

If I set Top = 2 I only get two records for the whole batch:

----------------------------------------------------
Querying resources as a GROUP
----------------------------------------------------
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv1-oeefrt7tlykau
timeseries 0
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv2-oeefrt7tlykau
timeseries 0
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv3-oeefrt7tlykau
timeseries 0
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv4-oeefrt7tlykau
timeseries 0
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv5-oeefrt7tlykau
timeseries 0
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv6-oeefrt7tlykau
timeseries 0
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv7-oeefrt7tlykau
timeseries 1
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv8-oeefrt7tlykau
timeseries 1
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv9-oeefrt7tlykau
timeseries 0
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv10-oeefrt7tlykau
timeseries 0

So I need at least Top = 20 to get all the records.

However, when the resources are used, the number of unique combinations of dimension values varies greatly.

For example, if I start using one one the key vaults, I get 4-5 records instead of 2 for each resource ID:

CleanShot 2024-04-22 at 10 44 52

----------------------------------------------------
Querying resources as a GROUP
----------------------------------------------------
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv1-oeefrt7tlykau
timeseries 5
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv2-oeefrt7tlykau
timeseries 4
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv3-oeefrt7tlykau
timeseries 4
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv4-oeefrt7tlykau
timeseries 4
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv5-oeefrt7tlykau
timeseries 4
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv6-oeefrt7tlykau
timeseries 4
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv7-oeefrt7tlykau
timeseries 4
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv8-oeefrt7tlykau
timeseries 4
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv9-oeefrt7tlykau
timeseries 4
/subscriptions/12cabcb4-86e8-404f-a3d2-1dc9982f45ca/resourceGroups/mbranca-azmetrics-test/providers/Microsoft.KeyVault/vaults/kv10-oeefrt7tlykau
timeseries 4

I guess it's impossible to calculate the exact number of records because it depends on the unique combinations of the dimensions, which vary depending on the Azure service.

@gracewilcox, what strategy do you recommend for setting the Top value, and what's the maximum number allowed?

gracewilcox commented 3 months ago

Hi @axw and @zmoog! Thank you for the detailed replies. The Top value is supposed to set the maximum records per resource ID, and as you discovered, it's currently not.

The service team is aware of the issue and is currently deploying a fix. Will let you know as soon as the bug is fixed. Thank you for your patience!

github-actions[bot] commented 3 months ago

Hi @zmoog, since you haven’t asked that we /unresolve the issue, we’ll close this out. If you believe further discussion is needed, please add a comment /unresolve to reopen the issue.

axw commented 3 months ago

/unresolve

github-actions[bot] commented 3 months ago

Hi $axw, only the original author of the issue can ask that it be unresolved. Please open a new issue with your scenario and details if you would like to discuss this topic with the team.

github-actions[bot] commented 3 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @AzmonActionG @AzmonAlerts @AzMonEssential @AzmonLogA @dadunl @SameergMS.

zmoog commented 3 months ago

If I run the tests at https://gist.github.com/zmoog/fcede6fcbe5ba11f9275c40a58eea38d I still get the same result.

@gracewilcox, was the service updated with the fix? Does the fix require a new api-version?

gracewilcox commented 3 months ago

@ToddKingMSFT, do you have guidance for this scenario?