Blackbaud-ChrisBlythe commented 4 years ago

We've configured Prometheus metric scraping in our AKS cluster(s) to capture metrics from our pods/containers, but have noticed that metrics are often missing for large chunks of time (even though the scrape interval is set t 60s). Just for reference, we have ~1000 pods running across 20 nodes, and several of the underlying services are generating a large number of dimensional metrics. I noticed that telegraf is being used under the covers and suspect it is dropping metrics due to the shear volume of data and it's internal buffer configuration (having run into similar situations using it in the past).

Do you have any recommendations or guidance for investigating this further or ultimately resolving (perhaps by tweaking the telegraf buffer configuration, etc.)?

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 413a60e0-2723-e21e-dd81-df480c45564b
Version Independent ID: f79752d6-8240-d62c-5635-a6529f9adf32
Content: Configure Azure Monitor for containers Prometheus Integration - Azure Monitor
Content Source: articles/azure-monitor/insights/container-insights-prometheus-integration.md
Service: azure-monitor
GitHub Login: @bwren
Microsoft Alias: magoedte

BharathNimmala-MSFT commented 4 years ago

@Blackbaud-ChrisBlythe - Thank you for your query, our team will look into it and get back to you at the earliest.

vishiy commented 4 years ago

@Blackbaud-ChrisBlythe - What scrapping mechanism you are using ? Is it node level or cluster level ? Also is it thru pod annotations or k8s services ?

Blackbaud-ChrisBlythe commented 4 years ago

@vishiy Thanks for the response... We have cluster and node level Prometheus data collection enabled with monitor_kubernetes_pods also enabled to scrape metrics from our app/service containers. Here is a condensed sample (with comments removed) of our azure-monitor-configuration.yaml for reference.

kind: ConfigMap
apiVersion: v1
metadata:
  name: container-azm-ms-agentconfig
  namespace: kube-system
data:
  schema-version: 
    v1
  config-version:
    ver1
  log-data-collection-settings: |-
    [log_collection_settings]
       [log_collection_settings.stdout]
          enabled = true
       [log_collection_settings.stderr]
          enabled = true
       [log_collection_settings.env_var]
          enabled = true
  prometheus-data-collection-settings: |-
    [prometheus_data_collection_settings.cluster]
        interval = "1m"
        monitor_kubernetes_pods = true
    [prometheus_data_collection_settings.node]
        interval = "1m"

vishiy commented 4 years ago

Thanks. Can you share the status of agent's replica pod?

Kubectl get pods -n=kube-system | grep omsagent-rs
Kubectl describe pod -n=kube-system <podname from # 1 above>
kubectl logs -n=kube-system <podname from # 1 above>

Blackbaud-ChrisBlythe commented 4 years ago

@vishiy Apologies for the delay... Upon further review, it looks like the replica pod is restarting constantly due to OOMKiller. The current memory request/limit is 250Mi/750Mi.

Name:                 omsagent-rs-7b778d75cc-rmb7h
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 aks-nodepool1-42391497-vmss00000l/10.100.3.90
Start Time:           Fri, 23 Oct 2020 13:03:22 -0400
Labels:               kubernetes.azure.com/managedby=aks
                      pod-template-hash=7b778d75cc
                      rsName=omsagent-rs
Annotations:          WSID: ZjYzNzc5ODktOWNmMC00Mzg0LTg4NWQtZGU0ZDQzNGMzMjA0
                      agentVersion: 1.10.0.1
                      dockerProviderVersion: 10.1.0-0
                      schema-versions: v1
Status:               Running
IP:                   10.100.3.106
Controlled By:        ReplicaSet/omsagent-rs-7b778d75cc
Containers:
  omsagent:
    Container ID:   docker://e36329e68f665eec8d867e8671bee0a8c6f562a07023fdc04bccd162efb3feb4
    Image:          mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod10052020
    Image ID:       docker-pullable://mcr.microsoft.com/azuremonitor/containerinsights/ciprod@sha256:532c608ad5e68f78ec73ca95ea5d985edd80aada10a8fcd9afd04caee10218de
    Ports:          25225/TCP, 25224/UDP, 25227/TCP
    Host Ports:     0/TCP, 0/UDP, 0/TCP
    State:          Running
      Started:      Mon, 26 Oct 2020 12:19:21 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    143
      Started:      Mon, 26 Oct 2020 12:15:24 -0400
      Finished:     Mon, 26 Oct 2020 12:19:19 -0400
    Ready:          True
    Restart Count:  661
    Limits:
      cpu:     1
      memory:  750Mi
    Requests:
      cpu:     150m
      memory:  250Mi
    Liveness:  exec [/bin/bash -c /opt/livenessprobe.sh] delay=60s timeout=1s period=60s #success=1 #failure=3

Is it possible to increase the memory allocation in the azure-monitor-configuration.yaml some how? Or, do you recommend some other means of tweaking these settings? Also, I would assume the recommendations are different for a cluster that's being deployed versus an existing cluster?

Blackbaud-ChrisBlythe commented 4 years ago

@femsulu Can we switch this back to "awaiting-product-team-response"?

Blackbaud-ChrisBlythe commented 4 years ago

@vishiy @femsulu Apologies for pestering... still hoping for some guidance here...

femsulu commented 4 years ago

Sorry for the delay @Blackbaud-BenLambert - following up on this internally. Will update as soon as possible.

vishiy commented 4 years ago

its mostly because 'monitor_kubernetes_pods' == true. How many pods do you have in the whole cluster that we are scraping ? Also do you know the metric volume per scrape ?

Blackbaud-ChrisBlythe commented 4 years ago

@vishiy I already presumed this was due to the volume of metrics being scraped from pods (which we purposely enabled with 'monitor_kubernetes_pods' == true).

To provide some rough estimates, we're running approximately 950 pods. And based on some mining in log analytics, it looks like we're peaking at about 26k metrics per min between the crashes/restarts (which roughly equates to 28 metrics per pod). I imagine this is a rather large metric volume for a single container/process to handle, which begs the question why couldn't this metric scraping work be spread out among the other omsagent containers that appear to be distributed across the nodes in the AKS cluster? That seems like a much more scalable solution...

Hopefully, something along those lines is in the works. However, in the mean time, the most obvious workaround is to allocate more memory to that container (omsagent-rs-*) to avoid being reaped by the OOM killer. So... Is it possible to increase the memory allocation in the azure-monitor-configuration.yaml some how? Or, do you recommend some other means of tweaking these settings? Also, I would assume the recommendations are different for a cluster that's being deployed versus an existing cluster?

Thanks again...

vishiy commented 4 years ago

@Blackbaud-ChrisBlythe - The reason we do this from replica (and not from daemon in each node) is to minimize the watch calls on API server (as pods are being watched thru API server watches), as this can grow based on number of nodes and can bring the apiserver and cluster inoperational. That said, we are looking to see if we watch thru local kubelet (rather than API server) so that we can avoid this single pod handling the entire cluster scenario.

That said, what other ways are your metrics exposed? Node service in each node ? We do have ways to scrape node URLs, which happens from each node. Those needs to be URLs, rather than watching for pods and we infering the endpoint automatically.

Since we are managed add-on in AKS, we dont support changing limits. To increase limits in hyper scale scenarios, only option is to use our HELM chart where you can specify limits & requests.

Blackbaud-ChrisBlythe commented 4 years ago

Thanks for the quick response @vishiy

Ah... additional load on the API server due to pod/container discovery... that makes sense.

Our primary goal is to scrape app/runtime metrics from the pods/containers, so scraping static, node-level URLs doesn't seem to accomplish that unless the target is somehow acting as a proxy or aggregator to all the pods on a particular node.

I'm curious about the HELM chart(s) you referenced.... Can you point me to those in the related docs or github?

Also, we are still scraping at 1m intervals... do you think decreasing the frequency would yield more stability? I guess another way to ask that questions is.... can cycles of the omsagent/telegraf collection process overlap if they take too long to complete?

vishiy commented 4 years ago

@Blackbaud-ChrisBlythe . I think the issue is due to number of connections made every minute (to those PODs) to scrape (which is in direct proportion to number of pods watched & scraped across the entire cluster) rather than the metric volume. I don't think decreasing the frequency (to like 3/5 min) would help, but you can try.

Below is the HELM chart - https://github.com/microsoft/Docker-Provider/tree/ci_prod/charts/azuremonitor-containers

Also, do you have these metrics exposed by k8s service/endpoints that can be configured to scrape, rather than every pod ?

rboucher commented 3 years ago

assign:bwren

rboucher commented 3 years ago

label:"TriagedFeb2021"

AbbyMSFT commented 2 years ago

It seems that we have provided feedback to the customer, and we have not had similar feedback. If the issue persists, we advise open a support ticket.

sign-off

AbbyMSFT commented 2 years ago

Hi @vishiy I see that this issue has been open for a very long time. Are you still experiencing this issue? Thanks, Abby

AbbyMSFT commented 2 years ago

We see that unfortunately, we have not been able to address this issue in a timely manner. The scope of our feedback channel here on GitHub covers specific documentation fixes. If you are still experiencing this issue, we can help redirect you to the right support channel to get an answer to your question: https://docs.microsoft.com/en-us/answers/topics/24223/azure-monitor.html We are closing this issue at this time.

MicrosoftDocs / azure-docs

Gaps in Prometheus scrape data??? #63988

Document Details

assign:bwren

label:"TriagedFeb2021"

sign-off

please-close