Closed sushant-jaiswal closed 5 years ago
@zqingqing1 @mboersma There are just 5 metrics exposed for Azure Kubernetes and few of them seem to have issues. With couple of others I am not able to proceed on my work. Could you please help me to get the answers for these metrics.
@sushant-jaiswal - metrics exposed by Kubernetes can be confusing, and the Azure metrics API on top of it makes it even more confusing!
Since the metrics [kube_pod_status_ready, kube_node_status_condition, kube_pod_status_phase]
have multiple dimensions, you need to expose/inspect those dimensions to get the view of those metrics. By default on REST queries to the Azure metrics API you'll just get "Total" for the metrics devoid of dimensions, which isn't really the part you want to inspect.
Here's how you can get all the dimensions for kube_node_status_condition
(as specified on the far right column of the exposed AKS metrics list):
metric_names="kube_node_status_condition"
metric_filter="status eq '*' and node eq '*' and condition eq '*'"
curl -G -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
"https://management.azure.com/${RESOURCE_URI}/providers/microsoft.insights/metrics?api-version=2018-01-01" \
--data-urlencode "metricnames=${metric_names}" \
--data-urlencode "\$filter=${metric_filter}"
However, once you venture down that road, more confusion will probably ensue. (Because it definitely was confusing while I wrote this up!) The kube_node_status_condition
metric is exposed via a behind-the-scenes scrape of kube-state-metrics for the cluster in question. This metric is similar as you'd see in the table layout of a kubectl get no <node> -o json
:
"conditions": [
{
"lastHeartbeatTime": "2018-08-21T08:13:45Z",
"lastTransitionTime": "2018-08-21T08:13:45Z",
"message": "RouteController created a route",
"reason": "RouteCreated",
"status": "False",
"type": "NetworkUnavailable"
},
{
"lastHeartbeatTime": "2018-09-05T22:49:50Z",
"lastTransitionTime": "2018-08-21T08:13:14Z",
"message": "kubelet has sufficient disk space available",
"reason": "KubeletHasSufficientDisk",
"status": "False",
"type": "OutOfDisk"
},
{
"lastHeartbeatTime": "2018-09-05T22:49:50Z",
"lastTransitionTime": "2018-08-21T08:13:14Z",
"message": "kubelet has sufficient memory available",
"reason": "KubeletHasSufficientMemory",
"status": "False",
"type": "MemoryPressure"
},
{
"lastHeartbeatTime": "2018-09-05T22:49:50Z",
"lastTransitionTime": "2018-08-21T08:13:14Z",
"message": "kubelet has no disk pressure",
"reason": "KubeletHasNoDiskPressure",
"status": "False",
"type": "DiskPressure"
},
{
"lastHeartbeatTime": "2018-09-05T22:49:50Z",
"lastTransitionTime": "2018-08-21T08:13:44Z",
"message": "kubelet is posting ready status. AppArmor enabled",
"reason": "KubeletReady",
"status": "True",
"type": "Ready"
}
]
In my opinion, that's already confusing. The fun only continues...
On the metric there's a node
dimension (which node is reporting), condition
dimension with potential values: [Ready, DiskPressure, MemoryPressure, OutOfDisk, NetworkUnavailable]
which is further refined by status
dimension which can have values: [True, False]
and then finally a binary metric value [0,1]
which corresponds to: "the last statement is either false[0] or true[1]." So you can have, in Prometheus parlance:
kube_node_status_condition{"condition":"Ready", "status":"true", "node":"aks-nodepool1-39410182-0"} 1
which means that the node: aks-nodepool1-39410182-0's "Ready" condition with "true" status is reporting "1" as the actual metric's value, which in turn means the node is ready.
Converse situation is:
kube_node_status_condition{"condition":"Ready", "status":"true", "node":"aks-nodepool1-39410182-0"} 0
which means that the node: aks-nodepool1-39410182-0's "Ready" condition with "true" status is reporting "0" as the actual metric's value, which in turn means the node is not ready.
To get you a little accelerated in the direction I think you're heading... here's a query that would give you data to see if nodes are in a NotReady state over time:
metric_filter="status eq 'true' and node eq '${NODE}' and condition eq 'Ready'"
metric_names="kube_node_status_condition"
curl -G -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
"https://management.azure.com/${RESOURCE_URI}/providers/microsoft.insights/metrics?api-version=2018-01-01" \
--data-urlencode "metricnames=${metric_names}" \
--data-urlencode "\$filter=${metric_filter}"
@sgoings Thanks for taking effort to reply with detailed explanation. I was able to get the metrics and dimensions you have mentioned. The CURL command gives me the json like below:
{
"cost": 0,
"timespan": "2018-09-07T03:58:04Z/2018-09-07T04:58:04Z",
"interval": "PT1M",
"value": [{
"id": "
I just wanted to check if the binary metric value mentioned by you is the one mentioned as "total": 1.0 in my JSON? OR the binary metric value is missing in JSON?
So, with my JSON above, am I good to say that- node: aks-agentpool-10913562-1's "NetworkUnavailable" condition with "false" status is reporting "1.0" as the actual metric's value, which in turn means the node's Network is Available? And node: aks-agentpool-10913562-2's "OutOfDisk" condition with "false" status is reporting "1.0" as the actual metric's value, which in turn means the node is not Out of Disk?
Thank you @sgoings. With some changes at our end, we were able to proceed.
Close this issue as it is document from service side. @sushant-jaiswal feel free to reopen it if you have more question here.
The actual values returned by Rest API calls don't match with description in Kubernetes Service metrics documentation.
For example, metric "kube_node_status_condition" should return "Statuses for various node conditions" but the Rest API call returns just a number (e.g. 18). This number does not make any sense for this status metric.
Another example, metric "kube_pod_status_phase" is supposed to return "Number of pods by phase" but it just returns total number of pods without status and makes the value same as what we get from metric "kube_pod_status_ready". When we check the Health Preview for the cluster, we can see number of pods in different phases (Running, Pending and Unknown).
We used this link to verify the values from a Rest API calls for various metrics.
Example for GET call - GET https://management.azure.com/%2Fsubscriptions%2F1248268d-051f-432a-8e63-e83a9d36e776%2FresourceGroups%2Faks-int-resource-group%2Fproviders%2FMicrosoft.ContainerService%2FmanagedClusters%2Faks-int-cluster%2F/providers/microsoft.insights/metrics?api-version=2018-01-01&metricnames=kube_node_status_condition
Any help is appreciated. Thanks.