Closed Blackbaud-ChrisBlythe closed 2 years ago
@Blackbaud-ChrisBlythe - Thank you for your query, our team will look into it and get back to you at the earliest.
@Blackbaud-ChrisBlythe - What scrapping mechanism you are using ? Is it node level or cluster level ? Also is it thru pod annotations or k8s services ?
@vishiy Thanks for the response... We have cluster and node level Prometheus data collection enabled with monitor_kubernetes_pods also enabled to scrape metrics from our app/service containers. Here is a condensed sample (with comments removed) of our azure-monitor-configuration.yaml for reference.
kind: ConfigMap
apiVersion: v1
metadata:
name: container-azm-ms-agentconfig
namespace: kube-system
data:
schema-version:
v1
config-version:
ver1
log-data-collection-settings: |-
[log_collection_settings]
[log_collection_settings.stdout]
enabled = true
[log_collection_settings.stderr]
enabled = true
[log_collection_settings.env_var]
enabled = true
prometheus-data-collection-settings: |-
[prometheus_data_collection_settings.cluster]
interval = "1m"
monitor_kubernetes_pods = true
[prometheus_data_collection_settings.node]
interval = "1m"
Thanks. Can you share the status of agent's replica pod?
@vishiy Apologies for the delay... Upon further review, it looks like the replica pod is restarting constantly due to OOMKiller. The current memory request/limit is 250Mi/750Mi.
Name: omsagent-rs-7b778d75cc-rmb7h
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: aks-nodepool1-42391497-vmss00000l/10.100.3.90
Start Time: Fri, 23 Oct 2020 13:03:22 -0400
Labels: kubernetes.azure.com/managedby=aks
pod-template-hash=7b778d75cc
rsName=omsagent-rs
Annotations: WSID: ZjYzNzc5ODktOWNmMC00Mzg0LTg4NWQtZGU0ZDQzNGMzMjA0
agentVersion: 1.10.0.1
dockerProviderVersion: 10.1.0-0
schema-versions: v1
Status: Running
IP: 10.100.3.106
Controlled By: ReplicaSet/omsagent-rs-7b778d75cc
Containers:
omsagent:
Container ID: docker://e36329e68f665eec8d867e8671bee0a8c6f562a07023fdc04bccd162efb3feb4
Image: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod10052020
Image ID: docker-pullable://mcr.microsoft.com/azuremonitor/containerinsights/ciprod@sha256:532c608ad5e68f78ec73ca95ea5d985edd80aada10a8fcd9afd04caee10218de
Ports: 25225/TCP, 25224/UDP, 25227/TCP
Host Ports: 0/TCP, 0/UDP, 0/TCP
State: Running
Started: Mon, 26 Oct 2020 12:19:21 -0400
Last State: Terminated
Reason: OOMKilled
Exit Code: 143
Started: Mon, 26 Oct 2020 12:15:24 -0400
Finished: Mon, 26 Oct 2020 12:19:19 -0400
Ready: True
Restart Count: 661
Limits:
cpu: 1
memory: 750Mi
Requests:
cpu: 150m
memory: 250Mi
Liveness: exec [/bin/bash -c /opt/livenessprobe.sh] delay=60s timeout=1s period=60s #success=1 #failure=3
Is it possible to increase the memory allocation in the azure-monitor-configuration.yaml some how? Or, do you recommend some other means of tweaking these settings? Also, I would assume the recommendations are different for a cluster that's being deployed versus an existing cluster?
@femsulu Can we switch this back to "awaiting-product-team-response"?
@vishiy @femsulu Apologies for pestering... still hoping for some guidance here...
Sorry for the delay @Blackbaud-BenLambert - following up on this internally. Will update as soon as possible.
its mostly because 'monitor_kubernetes_pods' == true. How many pods do you have in the whole cluster that we are scraping ? Also do you know the metric volume per scrape ?
@vishiy I already presumed this was due to the volume of metrics being scraped from pods (which we purposely enabled with 'monitor_kubernetes_pods' == true).
To provide some rough estimates, we're running approximately 950 pods. And based on some mining in log analytics, it looks like we're peaking at about 26k metrics per min between the crashes/restarts (which roughly equates to 28 metrics per pod). I imagine this is a rather large metric volume for a single container/process to handle, which begs the question why couldn't this metric scraping work be spread out among the other omsagent containers that appear to be distributed across the nodes in the AKS cluster? That seems like a much more scalable solution...
Hopefully, something along those lines is in the works. However, in the mean time, the most obvious workaround is to allocate more memory to that container (omsagent-rs-*) to avoid being reaped by the OOM killer. So... Is it possible to increase the memory allocation in the azure-monitor-configuration.yaml some how? Or, do you recommend some other means of tweaking these settings? Also, I would assume the recommendations are different for a cluster that's being deployed versus an existing cluster?
Thanks again...
@Blackbaud-ChrisBlythe - The reason we do this from replica (and not from daemon in each node) is to minimize the watch calls on API server (as pods are being watched thru API server watches), as this can grow based on number of nodes and can bring the apiserver and cluster inoperational. That said, we are looking to see if we watch thru local kubelet (rather than API server) so that we can avoid this single pod handling the entire cluster scenario.
That said, what other ways are your metrics exposed? Node service in each node ? We do have ways to scrape node URLs, which happens from each node. Those needs to be URLs, rather than watching for pods and we infering the endpoint automatically.
Since we are managed add-on in AKS, we dont support changing limits. To increase limits in hyper scale scenarios, only option is to use our HELM chart where you can specify limits & requests.
Thanks for the quick response @vishiy
Ah... additional load on the API server due to pod/container discovery... that makes sense.
Our primary goal is to scrape app/runtime metrics from the pods/containers, so scraping static, node-level URLs doesn't seem to accomplish that unless the target is somehow acting as a proxy or aggregator to all the pods on a particular node.
I'm curious about the HELM chart(s) you referenced.... Can you point me to those in the related docs or github?
Also, we are still scraping at 1m intervals... do you think decreasing the frequency would yield more stability? I guess another way to ask that questions is.... can cycles of the omsagent/telegraf collection process overlap if they take too long to complete?
@Blackbaud-ChrisBlythe . I think the issue is due to number of connections made every minute (to those PODs) to scrape (which is in direct proportion to number of pods watched & scraped across the entire cluster) rather than the metric volume. I don't think decreasing the frequency (to like 3/5 min) would help, but you can try.
Below is the HELM chart - https://github.com/microsoft/Docker-Provider/tree/ci_prod/charts/azuremonitor-containers
Also, do you have these metrics exposed by k8s service/endpoints that can be configured to scrape, rather than every pod ?
It seems that we have provided feedback to the customer, and we have not had similar feedback. If the issue persists, we advise open a support ticket.
Hi @vishiy I see that this issue has been open for a very long time. Are you still experiencing this issue? Thanks, Abby
We see that unfortunately, we have not been able to address this issue in a timely manner. The scope of our feedback channel here on GitHub covers specific documentation fixes. If you are still experiencing this issue, we can help redirect you to the right support channel to get an answer to your question: https://docs.microsoft.com/en-us/answers/topics/24223/azure-monitor.html We are closing this issue at this time.
We've configured Prometheus metric scraping in our AKS cluster(s) to capture metrics from our pods/containers, but have noticed that metrics are often missing for large chunks of time (even though the scrape interval is set t 60s). Just for reference, we have ~1000 pods running across 20 nodes, and several of the underlying services are generating a large number of dimensional metrics. I noticed that telegraf is being used under the covers and suspect it is dropping metrics due to the shear volume of data and it's internal buffer configuration (having run into similar situations using it in the past).
Do you have any recommendations or guidance for investigating this further or ultimately resolving (perhaps by tweaking the telegraf buffer configuration, etc.)?
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.