Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

[BUG] #3796

Open romangallego opened 1 year ago

romangallego commented 1 year ago

Describe the bug ama-metrics-xxxx Replicaset pods constantly restart: otel exporter cannot reach localhost:55680

To Reproduce Steps to reproduce the behavior:

  1. Created AKS cluster
  2. Created Azure Monitor Workspace
  3. Create Managed Grafana
  4. Enabled Metric Collection on Azure Monitor Workspace > Monitored Clusters.
  5. The AMA replicaset and daemon set get created: kubectl get rs -o wide | grep ama ama-metrics-7c7dbd77c8 1 1 1 5h3m prometheus-collector,addon-token-adapter mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector/images:6.7.2-main-06-26-2023-6ee07896,mcr.microsoft.com/aks/msi/addon-token-adapter:master.221118.2 pod-template-hash=7c7dbd77c8,rsName=ama-metrics ama-metrics-ksm-8559444f74 1 1 1 5h3m ama-metrics-ksm mcr.microsoft.com/oss/kubernetes/kube-state-metrics:v2.8.1 app.kubernetes.io/name=ama-metrics-ksm,pod-template-hash=8559444f74

kubectl get ds -o wide | grep ama ama-metrics-node 2 2 2 2 2 <none> 5h3m prometheus-collector,addon-token-adapter mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector/images:6.7.2-main-06-26-2023-6ee07896,mcr.microsoft.com/aks/msi/addon-token-adapter:master.221118.2 dsName=ama-metrics-node ama-metrics-win-node 0 0 0 0 0 <none> 5h3m prometheus-collector,addon-token-adapter-win mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector/images:6.7.2-main-06-26-2023-6ee07896-win,mcr.microsoft.com/aks/hcp/addon-token-adapter:20230120winbeta dsName=ama-metrics-win-node However the OTEL exporter Endpoint target localhost:55680 to send to send spans or metrics, which is the default value. No variable is set to overwrite it: OTEL_EXPORTER_OTLP_ENDPOINT OTEL_EXPORTER_OTLP_SPAN_ENDPOINT OTEL_EXPORTER_OTLP_METRIC_ENDPOINT

As result no metric gets inserted and can be queried on Azure Monitored Workspace

Expected behavior Otel Exporter to send metrics to DataCollection Endpoint, I believe?

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context They are azure managed services

vishiy commented 1 year ago

@romangallego - is it just replicset pods that are restarting or even the ama-metrics-node pods restart as well ?

romangallego commented 1 year ago

Thank you @vishiy , It is just the pods. The log of one of them is:

kubectl logs ama-metrics-7c7dbd77c8-dqbfx -nkube-system

Defaulted container "prometheus-collector" out of: prometheus-collector, addon-token-adapter MODE=advanced CONTROLLER_TYPE=ReplicaSet CLUSTER=/subscriptions/thesubscription/resourceGroups/therg/providers/Microsoft.ContainerService/managedClusters/aks-cicds-test Start Processing pod-annotation-based-scraping::configmap section not mounted, using defaults **End default-targets-namespace-keep-list-regex-settings Processing* ****Start prometheus-collector-settings Processing**** config::AZMON_CLUSTER_ALIAS:'' config::AZMON_CLUSTER_LABEL:aks-cicds-test ***End prometheus-collector-settings Processing* ***Start default-scrape-settings Processing* default-scrape-settings::warning::MAC mode is enabled. Only enabling targets kubestate,cadvisor,kubelet,kappiebasic & nodeexporter for linux before config map processing.... ****End default-scrape-settings Processing**** ***Start debug-mode Settings Processing*** **End debug-mode Settings Processing** **Start default-targets-metrics-keep-list Processing** default-scrape-keep-lists::minimalIngestionProfile=true, MAC is enabled. Applying appropriate MAC Regexes *End default-targets-metrics-keep-list Processing* **Start default-targets-scrape-interval-settings Processing* *End default-targets-scrape-interval-settings Processing**** prometheus-config-merger::warning::Custom prometheus config does not exist, using only default scrape targets if they are enabled **Start Merging Default and Custom Prometheus Config** prometheus-config-merger::Updating scrape interval config for /opt/microsoft/otelcollector/default-prom-configs/kubestateDefault.yml prometheus-config-merger::Adding keep list regex or minimal ingestion regex for /opt/microsoft/otelcollector/default-prom-configs/kubestateDefault.yml prometheus-config-merger::Done merging 1 default prometheus config(s) prometheus-config-merger::Starting to merge default prometheus config values in collector template as backup **Done Merging Default and Custom Prometheus Config***** prom-config-validator::No custom prometheus config found. Only using default scrape configs prom-config-validator::Config file provided - /opt/defaultsMergedConfig.yml prom-config-validator::Successfully generated otel config prom-config-validator::Loading configuration... prom-config-validator::Successfully loaded and validated prometheus config prom-config-validator::Prometheus default scrape config validation succeeded, using this as collector config prom-config-validator::Use default prometheus config: true checking health of token adapter after 1 secs found token adapter to be healthy after 1 secs ME_CONFIG_FILE=/usr/sbin/me.config customResourceId=/subscriptions/thesubscription/resourceGroups/therg/providers/Microsoft.ContainerService/managedClusters/aks-cicds-test customRegion=westeurope Waiting for 10s for token adapter sidecar to be up and running so that it can start serving IMDS requests Setting env variables from envmdsd file for MDSD Starting MDSD 1.23.5 MDSD_VERSION=Waiting for 30s for MDSD to get the config and put them in place for ME Reading me config file as a string for configOverrides paramater Starting metricsextension ME_VERSION=2.2023.224.2214-1.cm2 RUBY_VERSION=ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [x86_64-linux] GOLANG_VERSION=go version go1.19.10 linux/amd64 Starting otelcollector with only default scrape configs enabled OTELCOLLECTOR_VERSION=custom-collector-distro version 0.73.0 PROMETHEUS_VERSION=2.43.0 starting telegraf TELEGRAF_VERSION=1.25.2-2.cm2 starting fluent-bit FLUENT_BIT_VERSION=Fluent Bit v2.0.9 Git commit: FLUENT_BIT_CONFIG_FILE=/opt/fluent-bit/fluent-bit.conf starting inotify for watching mdsd config update AZMON_CONTAINER_START_TIME=1689675483 AZMON_CONTAINER_START_TIME_READABLE=Tue Jul 18 10:18:03 UTC 2023 File Doesnt Exist. Creating file... Fluent Bit v2.0.9

{"filepath":"/opt/microsoft/otelcollector/collector-log.txt","level":"error","ts":1689675500.48125,"caller":"exporterhelper/queued_retry.go:367","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","data_type":"metrics","name":"otlp","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection refused\"","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(retrySender).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:367\ngo.opentelemetry.io/collector/exporter/exporterhelper.(metricsSenderWithObservability).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/metrics.go:136\ngo.opentelemetry.io/collector/exporter/exporterhelper.(queuedRetrySender).start.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:205\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue).StartConsumers.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/internal/bounded_memory_queue.go:58"} {"filepath":"/opt/microsoft/otelcollector/collector-log.txt","level":"error","ts":1689675530.339798,"caller":"exporterhelper/queued_retry.go:367","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","data_type":"metrics","name":"otlp","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection refused\"","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(retrySender).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:367\ngo.opentelemetry.io/collector/exporter/exporterhelper.(metricsSenderWithObservability).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/metrics.go:136\ngo.opentelemetry.io/collector/exporter/exporterhelper.(queuedRetrySender).start.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:205\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue).StartConsumers.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/internal/bounded_memory_queue.go:58"} {"filepath":"/dev/write-to-traces","time":"2023-07-18T10:18:51","message":"No configuration present for the AKS resource"} {"filepath":"/opt/microsoft/otelcollector/collector-log.txt","level":"error","ts":1689675560.39544,"caller":"exporterhelper/queued_retry.go:367","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","data_type":"metrics","name":"otlp","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection refused\"","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(retrySender).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:367\ngo.opentelemetry.io/collector/exporter/exporterhelper.(metricsSenderWithObservability).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/metrics.go:136\ngo.opentelemetry.io/collector/exporter/exporterhelper.(queuedRetrySender).start.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:205\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue).StartConsumers.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/internal/bounded_memory_queue.go:58"} {"filepath":"/opt/microsoft/otelcollector/collector-log.txt","level":"error","ts":1689675590.449191,"caller":"exporterhelper/queued_retry.go:367","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","data_type":"metrics","name":"otlp","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection refused\"","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(retrySender).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:367\ngo.opentelemetry.io/collector/exporter/exporterhelper.(metricsSenderWithObservability).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/metrics.go:136\ngo.opentelemetry.io/collector/exporter/exporterhelper.(queuedRetrySender).start.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:205\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue).StartConsumers.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/internal/bounded_memory_queue.go:58"} {"time":1689675602.557504,"filepath":"/opt/microsoft/linuxmonagent/mdsd.err","log":"2023-07-18T10:19:58.2747220Z: [/__w/1/s/external/WindowsAgent/src/shared/mcsmanager/lib/src/RefreshConfigurations.cpp:318,GetAgentConfigurations]Could not obtain configuration from https://global.handler.control.monitor.azure.com after first round of tries. Will try again with a fallback endpoint. ErrorCode:1310977"} {"filepath":"/opt/microsoft/otelcollector/collector-log.txt","level":"error","ts":1689675620.303337,"caller":"exporterhelper/queued_retry.go:367","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","data_type":"metrics","name":"otlp","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection refused\"","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(retrySender).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:367\ngo.opentelemetry.io/collector/exporter/exporterhelper.(metricsSenderWithObservability).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/metrics.go:136\ngo.opentelemetry.io/collector/exporter/exporterhelper.(queuedRetrySender).start.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:205\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue).StartConsumers.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/internal/bounded_memory_queue.go:58"} shutting down

romangallego commented 1 year ago

hi @vishiy et team, Do you know why the prometheus-collector is saying

 connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection

? How could I troubleshoot the origin of the issue?

romangallego commented 1 year ago

Hey, I have found you can enable debug mode by creating the config map and apply it. I will try it. https://github.com/Azure/prometheus-collector/blob/main/otelcollector/configmaps/ama-metrics-settings-configmap.yaml

Upon enabling debug-mode: true

I got this: {"time":1689924374.231554,"filepath":"/opt/microsoft/linuxmonagent/mdsd.err","log":"2023-07-21T07:26:09.4060000Z: [/__w/1/s/external/WindowsAgent/src/shared/mcsmanager/lib/src/RefreshConfigurations.cpp:318,GetAgentConfigurations]Could not obtain configuration from https://global.handler.control.monitor.azure.com after first round of tries. Will try again with a fallback endpoint. ErrorCode:1310977"}

Is this relevant? I have FW open for that URL and the rest documented as required. I am using linux systems, so, my guess is this is. not relevant. Why are not the services localhost:9090 and localhost:55680 coming up in the pod?

vishiy commented 1 year ago

@romangallego - are there any firewall configurations that preventing config download access & egress to fail. Basically our service (running on 55680) is shutting down after a while due to this. Please check network uris to be allow-listed here -- https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/prometheus-metrics-enable?tabs=azure-portal#network-firewall-requirements

muhyid commented 11 months ago

Dear All, I have a similar problem with newly installed AKS, but I also have ama-metrics-node pods errors. I don't have a Firewall or any other restrictions for accessing the outside of the cluster. Kubernetes version: 1.27.3 Private AKS Cluster Prometheus containers located in ama-metrics-node and ama-metrics pods are restarting almost every 17.5 minutes

P.S: I created Azure Monitor Workspace and selected everything from Diagnostic Settings but didn't create Managed Prometheus because I didn't want to use it.

ama-metrics-node:

{"filepath":"/opt/microsoft/otelcollector/collector-log.txt","level":"error","ts":1697581881.758687,"caller":"exporterhelper/queued_retry.go:367","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","data_type":"metrics","name":"otlp","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection refused\"","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(retrySender).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:367\ngo.opentelemetry.io/collector/exporter/exporterhelper.(metricsSenderWithObservability).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/metrics.go:136\ngo.opentelemetry.io/collector/exporter/exporterhelper.(queuedRetrySender).start.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:205\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue).StartConsumers.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/internal/bounded_memory_queue.go:58"}

ama-metrics:

{"filepath":"/opt/microsoft/otelcollector/collector-log.txt","level":"error","ts":1697581793.831743,"caller":"exporterhelper/queued_retry.go:367","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","data_type":"metrics","name":"otlp","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection refused\"","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(retrySender).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:367\ngo.opentelemetry.io/collector/exporter/exporterhelper.(metricsSenderWithObservability).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/metrics.go:136\ngo.opentelemetry.io/collector/exporter/exporterhelper.(queuedRetrySender).start.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/queued_retry.go:205\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue).StartConsumers.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.74.0/exporterhelper/internal/bounded_memory_queue.go:58"}

Do you have any comments on these?

muhyid commented 11 months ago

Moved to [BUG]: #3960

vishiy commented 11 months ago

Hi @muhyid - Have you setup AMPLS for private link? Looks like some components are shutting down as its unable to download config. Pls see here -- https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/private-link-data-ingestion

michaelgregson commented 10 months ago

Hi @muhyid I had the same issue.

It was caused by me previously trying to set up AMPLS before realising it was the firewall rules I needed to amend.

Deleting the AMPLS config had left behind DNS entries for privatelink.monitor.azure.com and privatelink.northeurope.prometheus.monitor.azure.com.

Once I deleted these and restarted the ama-metrics-nodes the issue resolved.

I hope this helps

dgshue commented 10 months ago

@michaelgregson Where did you find this config? I am trying to use the AMPLS, but just wanting to see what the config looks like first...

michaelgregson commented 9 months ago

@dgshue apologies "config" was a misnomer. I deleted the [https://portal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/privateLinkScopes] (AMPLS) itself.

And then the [https://portal.azure.com/#view/HubsExtension/BrowseResource/resourceType/Microsoft.Network%2FprivateDnsZones](Private DNS Zones) it had created.

harivmu commented 8 months ago

Hello everyone, I am also facing the same error for a Private Cluster. Did anyone managed to solve the issue?

arpanD93 commented 6 months ago

I'm also facing the same issue after enabling managed Prometheus and Grafana for AKS minitoring

{"filepath":"/opt/microsoft/otelcollector/collector-log.txt","level":"error","ts":1709913969.106274,"caller":"exporterhelper/common.go:49","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","data_type":"metrics","name":"otlp","dropped_items":3037,"error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection refused\"","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(errorLoggingRequestSender).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.90.0/exporterhelper/common.go:49\ngo.opentelemetry.io/collector/exporter/exporterhelper.(metricsSenderWithObservability).send\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.90.0/exporterhelper/metrics.go:170\ngo.opentelemetry.io/collector/exporter/exporterhelper.(queueSender).consume\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.90.0/exporterhelper/queue_sender.go:120\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue[...]).Consume\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.90.0/exporterhelper/internal/bounded_memory_queue.go:55\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(QueueConsumers[...]).Start.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.90.0/exporterhelper/internal/consumers.go:43"} {"filepath":"/opt/microsoft/otelcollector/collector-log.txt","level":"error","ts":1709913969.106401,"caller":"exporterhelper/queue_sender.go:128","msg":"Exporting failed. No more retries left. Dropping data.","kind":"exporter","data_type":"metrics","name":"otlp","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55680: connect: connection refused\"","dropped_items":3037,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(queueSender).consume\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.90.0/exporterhelper/queue_sender.go:128\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue[...]).Consume\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.90.0/exporterhelper/internal/bounded_memory_queue.go:55\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(QueueConsumers[...]).Start.func1\n\t/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.90.0/exporterhelper/internal/consumers.go:43"} shutting down

For me connectivity is allowed to the below FQDNs as I'm using UDR for AKS outbound

image

@Azure/aks-leads Can you please provide an update ASAP ?

soberich commented 6 months ago

cc @vishiy,

Microsoft??? This seems to be unresolved in several open issues and no good workarounds or suggestions provided! Facing the same issue What this OTLP exporter trying to export and where exactly, could you explain more precisely? Why does it try to connect to localhost? Why on legacy port?

Can we just disable this and, I don't know, deploy Grafana in cluster and use that with Managed Prometheus directly instead of sending it to AM workspace? Would it help to suspend the issue at least? Thanks

qaiserali commented 4 months ago

Any progress on this? I'm also stuck with this issue.

2024-05-14T18:50:01Z I! Loading config: /opt/telegraf/telegraf-prometheus-collector-ds.conf {"time":"2024-05-14T18:50:58","message":"No configuration present for the AKS resource","filepath":"/dev/write-to-traces"} {"time":1715712714.489822,"filepath":"/opt/microsoft/linuxmonagent/mdsd.err","log":"2024-05-14T18:51:54.4897440Z: [/__w/1/s/external/WindowsAgent/src/shared/mcsmanager/lib/src/RefreshConfigurations.cpp:318,GetAgentConfigurations]Could not obtain configuration from https://global.handler.control.monitor.azure.com after first round of tries. Will try again with a fallback endpoint. ErrorCode:1310977"} shutting down

kamilzzz commented 3 months ago

For me this is not the issue with Managed Prometheus addon.

We encountered the same and in our case the issue as correctly pointed out by @vishiy was the DNS resolution (in our case, after configuring privatelink DNS zone for monitor.azure.com).

Logs are more or less clear, it fails to connect to https://global.handler.control.monitor.azure.com.

First thing to try would be to exec into any pod running on the same cluster and run nslookup global.handler.control.monitor.azure.com. If it doesn't resolve, that is your issue and you need to sort out DNS resolution (enabling AMPLS/privatelink.monitor.azure.com blindly breaks a lot of scenarios as it works differently than any other Private Endpoint).

If DNS resolving looks correct, next step would be to test connectivity, for example via telnet client/curl to global.handler.control.monitor.azure.com. If that doesn't work, the Firewall is your problem and you need to sort out Firewall rules.

If you successfully performed these 2 tests from other pod and ama-metrics pods are still reporting issue - then it may suggest issue with addon.

ChrisJD-VMC commented 3 months ago

I've spent a couple of weeks getting AKS clusters connecting to Azure Monitor Workspace (AMW) for Prometheus. I'm not an expert by any means but I've got it working for clusters in both the same and different regions (from the AMW) using Azure Monitor Private Link Scope (AMPLS).

I've had the OP's error several times and it has always boiled down to an issue of some sort with my setup. It seems to occur when the metrics container can't connect to a Data Collection Endpoint to get configuration from or send data to.

As @kamilzzz noted, DNS is one possible issue, especially with AMPLS.

Other possible issues

  1. No DCE in the clusters region. You need a DCE in the same region as the cluster to provide configuration to the cluster.
  2. No DCE in the Azure Monitor Workspace's region. If the workspace and cluster are in the same region this can be the same endpoint as above.
  3. Not connected to the AMPLS (if using)

Some things to check Note: Some of this probably isn't needed if you aren't using AMPLS, but it may give you some ideas of things to check.

  1. You have a DCE in the same region as the AMW. This will be the data ingestion DCE that all clusters use.
  2. You have a DCE in each region with a cluster you want to monitor (If the workspace and a cluster are in the same region this can be the same endpoint as above.) This will be the configuration DCE that clusters in that region connect to to fetch configuration from.
  3. You have a Data Collection Rule (DCR) that links the data ingestion DCE to the AMW.
  4. Each cluster is linked to the configuration DCE in the region it is in. I've done everything via terraform, this requirement corresponds to every cluster having a azurerm_monitor_data_collection_rule_association resource with a target_resource_id of the cluster and a data_collection_endpoint_id of the regions DCE for configuration.
  5. Each cluster is linked to the data ingestion DCE in the AMW's region. This requirement corresponds to every cluster having a azurerm_monitor_data_collection_rule_association resource with a target_resource_id of the cluster and a data_collection_rule_id of the DCR for data ingestion.
  6. Check your DNS. 7. If your clusters are isolated on there own networks you need to setup appropriate private DNS for each cluster and link them to the clusters network.
  7. If your clusters are on networks that are connected you need to setup the private DNS once, probably on you hub network if you have one.
  8. General AMPLS setup and the DNS zones needed is in https://learn.microsoft.com/en-us/azure/azure-monitor/logs/private-link-configure You'll need a private endpoint for each cluster network if on isolated networks.
  9. Private Link Scoped Services are needed for each DCE to link them to the AMPLS
  10. You'll need Prometheus rule groups (azurerm_monitor_alert_prometheus_rule_group) scoped to the AMW to configure which metrics are colelcted. If you don't also scope to a specific cluster the rules will be used for all connected clusters.
  11. For Alerts I didn't have much luck setting global rules so I scoped the alert rules to AMW + cluster. (Anyone know if it's possible to have default alerts like data collection?)
qaiserali commented 3 months ago

@ChrisJD-VMC Thank you for the detailed reply. I initially thought that DCE would be accessible via the internet, but it seems that's not possible when the cluster is set to Private. AMPLS needs to be configured, and DCE should be connected to it.

AxelTob commented 3 months ago

Some things to check Note: Some of this probably isn't needed if you aren't using AMPLS, but it may give you some ideas of things to check.

1. You have a DCE in the same region as the AMW. This will be the data ingestion DCE that all clusters use.

2. You have a DCE in each region with a cluster you want to monitor (If the workspace and a cluster are in the same region this can be the same endpoint as above.) This will be the configuration DCE that clusters in that region connect to to fetch configuration from.

3. You have a Data Collection Rule (DCR) that links the data ingestion DCE to the AMW.

4. Each cluster is linked to the configuration DCE in the region it is in. I've done everything via terraform, this requirement corresponds to every cluster having a `azurerm_monitor_data_collection_rule_association` resource with a `target_resource_id` of the cluster and a `data_collection_endpoint_id` of the regions DCE for configuration.

5. Each cluster is linked to the data ingestion DCE in the AMW's region. This requirement corresponds to every cluster having a `azurerm_monitor_data_collection_rule_association`  resource with a `target_resource_id` of the cluster and a `data_collection_rule_id` of the DCR for data ingestion.

6. Check your DNS. 7. If your clusters are isolated on there own networks you need to setup appropriate private DNS for each cluster and link them to the clusters network.

7. If your clusters are on networks that are connected you need to setup the private DNS once, probably on you hub network if you have one.

8. General AMPLS setup and the DNS zones needed is in https://learn.microsoft.com/en-us/azure/azure-monitor/logs/private-link-configure You'll need a private endpoint for each cluster network if on isolated networks.

9. Private Link Scoped Services are needed for each DCE to link them to the AMPLS

10. You'll need Prometheus rule groups (azurerm_monitor_alert_prometheus_rule_group) scoped to the AMW to configure which metrics are colelcted. If you don't also scope to a specific cluster the rules will be used for all connected clusters.

11. For Alerts I didn't have much luck setting global rules so I scoped the alert rules to AMW + cluster. (Anyone know if it's possible to have default alerts like data collection?)

It's possible to setup alert rules. I've used azapi with success. https://github.com/jeffwmiles/aks-prometheus-grafana/blob/part2/terraform/monitoring.tf

This exist as well: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/monitor_alert_prometheus_rule_group

But have not tried that one

jgresc commented 2 months ago

I enabled Container Insights in AKS month ago. It used to work fine. Now, I introduced an AMPLS in the network. It stopped working. I did it through IAC and also tried through the portal. When you enable the monitoring in the portal, it creates a DCR, but no DCE. I added the DCE afterwards, it still does not collect data.

Can you get some support please.


Edit: After adding the DCE and associating it with the DCR that is associated with the AMPLS, the private endpoint of the AMPLs gets a new DNS configuration, the one from the DCE. I forgot to add it to the private DNS zome. After adding, it works.

microsoft-github-policy-service[bot] commented 3 weeks ago

Issue needing attention of @Azure/aks-leads