Kubernetes Service metrics values not matching with description in metrics documentation

sushant-jaiswal commented 6 years ago

The actual values returned by Rest API calls don't match with description in Kubernetes Service metrics documentation.

For example, metric "kube_node_status_condition" should return "Statuses for various node conditions" but the Rest API call returns just a number (e.g. 18). This number does not make any sense for this status metric.

Another example, metric "kube_pod_status_phase" is supposed to return "Number of pods by phase" but it just returns total number of pods without status and makes the value same as what we get from metric "kube_pod_status_ready". When we check the Health Preview for the cluster, we can see number of pods in different phases (Running, Pending and Unknown).

We used this link to verify the values from a Rest API calls for various metrics.

Example for GET call - GET https://management.azure.com/%2Fsubscriptions%2F1248268d-051f-432a-8e63-e83a9d36e776%2FresourceGroups%2Faks-int-resource-group%2Fproviders%2FMicrosoft.ContainerService%2FmanagedClusters%2Faks-int-cluster%2F/providers/microsoft.insights/metrics?api-version=2018-01-01&metricnames=kube_node_status_condition

Any help is appreciated. Thanks.

milismsft commented 6 years ago

@zqingqing1 @mboersma

sushant-jaiswal commented 6 years ago

@zqingqing1 @mboersma There are just 5 metrics exposed for Azure Kubernetes and few of them seem to have issues. With couple of others I am not able to proceed on my work. Could you please help me to get the answers for these metrics.

sgoings commented 6 years ago

@sushant-jaiswal - metrics exposed by Kubernetes can be confusing, and the Azure metrics API on top of it makes it even more confusing!

Since the metrics [kube_pod_status_ready, kube_node_status_condition, kube_pod_status_phase] have multiple dimensions, you need to expose/inspect those dimensions to get the view of those metrics. By default on REST queries to the Azure metrics API you'll just get "Total" for the metrics devoid of dimensions, which isn't really the part you want to inspect.

Here's how you can get all the dimensions for kube_node_status_condition (as specified on the far right column of the exposed AKS metrics list):

metric_names="kube_node_status_condition"
metric_filter="status eq '*' and node eq '*' and condition eq '*'"
curl -G -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
  "https://management.azure.com/${RESOURCE_URI}/providers/microsoft.insights/metrics?api-version=2018-01-01" \
  --data-urlencode "metricnames=${metric_names}" \
  --data-urlencode "\$filter=${metric_filter}"

However, once you venture down that road, more confusion will probably ensue. (Because it definitely was confusing while I wrote this up!) The kube_node_status_condition metric is exposed via a behind-the-scenes scrape of kube-state-metrics for the cluster in question. This metric is similar as you'd see in the table layout of a kubectl get no <node> -o json:

"conditions": [
            {
                "lastHeartbeatTime": "2018-08-21T08:13:45Z",
                "lastTransitionTime": "2018-08-21T08:13:45Z",
                "message": "RouteController created a route",
                "reason": "RouteCreated",
                "status": "False",
                "type": "NetworkUnavailable"
            },
            {
                "lastHeartbeatTime": "2018-09-05T22:49:50Z",
                "lastTransitionTime": "2018-08-21T08:13:14Z",
                "message": "kubelet has sufficient disk space available",
                "reason": "KubeletHasSufficientDisk",
                "status": "False",
                "type": "OutOfDisk"
            },
            {
                "lastHeartbeatTime": "2018-09-05T22:49:50Z",
                "lastTransitionTime": "2018-08-21T08:13:14Z",
                "message": "kubelet has sufficient memory available",
                "reason": "KubeletHasSufficientMemory",
                "status": "False",
                "type": "MemoryPressure"
            },
            {
                "lastHeartbeatTime": "2018-09-05T22:49:50Z",
                "lastTransitionTime": "2018-08-21T08:13:14Z",
                "message": "kubelet has no disk pressure",
                "reason": "KubeletHasNoDiskPressure",
                "status": "False",
                "type": "DiskPressure"
            },
            {
                "lastHeartbeatTime": "2018-09-05T22:49:50Z",
                "lastTransitionTime": "2018-08-21T08:13:44Z",
                "message": "kubelet is posting ready status. AppArmor enabled",
                "reason": "KubeletReady",
                "status": "True",
                "type": "Ready"
            }
]

In my opinion, that's already confusing. The fun only continues...

On the metric there's a node dimension (which node is reporting), condition dimension with potential values: [Ready, DiskPressure, MemoryPressure, OutOfDisk, NetworkUnavailable] which is further refined by status dimension which can have values: [True, False] and then finally a binary metric value [0,1] which corresponds to: "the last statement is either false[0] or true[1]." So you can have, in Prometheus parlance:

kube_node_status_condition{"condition":"Ready", "status":"true", "node":"aks-nodepool1-39410182-0"} 1

which means that the node: aks-nodepool1-39410182-0's "Ready" condition with "true" status is reporting "1" as the actual metric's value, which in turn means the node is ready.

Converse situation is:

kube_node_status_condition{"condition":"Ready", "status":"true", "node":"aks-nodepool1-39410182-0"} 0

which means that the node: aks-nodepool1-39410182-0's "Ready" condition with "true" status is reporting "0" as the actual metric's value, which in turn means the node is not ready.

To get you a little accelerated in the direction I think you're heading... here's a query that would give you data to see if nodes are in a NotReady state over time:

metric_filter="status eq 'true' and node eq '${NODE}' and condition eq 'Ready'"
metric_names="kube_node_status_condition"

curl -G -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
  "https://management.azure.com/${RESOURCE_URI}/providers/microsoft.insights/metrics?api-version=2018-01-01" \
  --data-urlencode "metricnames=${metric_names}" \
  --data-urlencode "\$filter=${metric_filter}"

sushant-jaiswal commented 6 years ago

@sgoings Thanks for taking effort to reply with detailed explanation. I was able to get the metrics and dimensions you have mentioned. The CURL command gives me the json like below:

{ "cost": 0, "timespan": "2018-09-07T03:58:04Z/2018-09-07T04:58:04Z", "interval": "PT1M", "value": [{ "id": "/providers/Microsoft.Insights/metrics/kube_node_status_condition", "type": "Microsoft.Insights/metrics", "name": { "value": "kube_node_status_condition", "localizedValue": "Statuses for various node conditions" }, "unit": "Count", "timeseries": [{ "metadatavalues": [{ "name": { "value": "condition", "localizedValue": "condition" }, "value": "NetworkUnavailable" }, { "name": { "value": "node", "localizedValue": "node" }, "value": "aks-agentpool-10913562-1" }, { "name": { "value": "status", "localizedValue": "status" }, "value": "false" } ], "data": [{ "timeStamp": "2018-09-07T03:58:00Z", "total": 1.0 }, { "timeStamp": "2018-09-07T03:59:00Z", "total": 1.0 } ] }, { "metadatavalues": [{ "name": { "value": "condition", "localizedValue": "condition" }, "value": "OutOfDisk" }, { "name": { "value": "node", "localizedValue": "node" }, "value": "aks-agentpool-10913562-2" }, { "name": { "value": "status", "localizedValue": "status" }, "value": "false" } ], "data": [{ "timeStamp": "2018-09-07T03:58:00Z", "total": 1.0 }, { "timeStamp": "2018-09-07T03:59:00Z", "total": 1.0 } ] } ] } ], "namespace": "Microsoft.ContainerService/managedClusters", "resourceregion": "westeurope" }

I just wanted to check if the binary metric value mentioned by you is the one mentioned as "total": 1.0 in my JSON? OR the binary metric value is missing in JSON?

So, with my JSON above, am I good to say that- node: aks-agentpool-10913562-1's "NetworkUnavailable" condition with "false" status is reporting "1.0" as the actual metric's value, which in turn means the node's Network is Available? And node: aks-agentpool-10913562-2's "OutOfDisk" condition with "false" status is reporting "1.0" as the actual metric's value, which in turn means the node is not Out of Disk?

sushant-jaiswal commented 5 years ago

Thank you @sgoings. With some changes at our end, we were able to proceed.

yaohaizh commented 5 years ago

Close this issue as it is document from service side. @sushant-jaiswal feel free to reopen it if you have more question here.

Azure / azure-libraries-for-java

Kubernetes Service metrics values not matching with description in metrics documentation #563