Closed sebastienfi closed 2 years ago
Hi 👋 , we will need some more information to pinpoint the issue that you are facing. Please submit a support ticket instead.
I'm having the same exact issue.
Running datadog helm chart v2.26.2
in a GKE Dataplane V2.
Pinpointing the error in our side, it seems the datadog agent is trying to scrape the metrics from cilium operator on port 9090
, by default port by the cilium operator is 6942
.
Haven't seen a way to change the scraping target for cilium operator in the chart, can you please indicate which is the best way to proceed?
I am also facing the same issue. I am using Google Cloud Run for Anthos on GKE with Dataplane V2 enabled.
@sebastienfi were you able to resolve this issue? Is there any work around? cc: @yzhan289 We are facing the same issue. Chart: datadog-2.31.0 Agent: v7.34.0 cilium: 1.10.2 Dataplane V2 is enabled.
Hi @dllegru @NaxAlpha @kkirpichnikov , please file a support case for this so we can collect more information on why this is happening for you.
Thanks, will do.
hey folks, any update?
@NaurisSadovskis in my case they said that I have to do this in my values.yaml
file and it helped.
datadog:
env:
- name: DD_IGNORE_AUTOCONF
value: "cilium"
Thanks @kkirpichnikov. For those using the Helm chart, the value is:
datadog:
ignoreAutoConfig:
- cilium
Did anybody reach a solution that is not ignoring cilium? Support tickets are nice and all, but having it documented here would save us from opening yet another support ticket.
The solution is actually pretty simple, adding it here for the next poor soul struggling with this.
You can customize the check. https://app.datadoghq.com/integrations/cilium?search=cil has the details, but for me, using the K8s operator deployed with helm, the solution was simply to add annotations to the pods by adding the following to my helm values:
podAnnotations:
ad.datadoghq.com/cilium-agent.checks: |
{
"cilium": {
"init_config": {},
"instances": [{"agent_endpoint": "http://%%host%%:9962/metrics","use_openmetrics": "true"}]
}
}
ad.datadoghq.com/cilium-operator.checks: |
{
"cilium": {
"init_config": {},
"instances": [{"operator_endpoint": "http://%%host%%:9963/metrics","use_openmetrics": "true"}]
}
}
Worked perfectly, thank you @argais
This fails on GKE as well. Working through a ticket now (1319416) and we were able to get metrics by adding the following to the helm values file:
datadog:
confd:
cilium.yaml: |-
ad_identifiers:
- cilium
init_config:
instances:
- agent_endpoint: http://%%host%%:9990/metrics
use_openmetrics: true
However the auto config is still puking from the agent side:
cilium (2.4.0)
--------------
Instance ID: cilium:42432dabab946ec3 [ERROR]
Configuration Source: file:/etc/datadog-agent/conf.d/cilium.d/auto_conf.yaml
Total Runs: 23
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 23
Average Execution Time : 87ms
Last Execution Date : 2023-08-25 11:41:24 UTC (1692963684000)
Last Successful Execution Date : Never
Error: HTTPConnectionPool(host='10.60.80.22', port=9962): Max retries exceeded with url: /metrics (Caused by NewCo
nnectionError('<urllib3.connection.HTTPConnection object at 0x7cf4aae82f10>: Failed to establish a new connection: [Errn
o 111] Connection refused'))
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 95, in create_co
The configuration source for some reason is not using the new /etc/datadog-agent/conf.d/cilium.yaml. Still feels like something is off but again we are seeing metrics now. That said to clear up the above error we also added:
datadog:
ignoreAutoConfig:
- cilium
That fix the agent status:
cilium (2.4.0)
--------------
Instance ID: cilium:a9c4ff2d2d8bb9ae [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/cilium.yaml
Total Runs: 20
Metric Samples: Last Run: 1,857, Total: 37,140
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 20
Average Execution Time : 923ms
Last Execution Date : 2023-08-25 12:02:47 UTC (1692964967000)
Last Successful Execution Date : 2023-08-25 12:02:47 UTC (1692964967000)
Also I'm not 100% sure but I think you need to enable GKE Dataplane V2 metrics on your cluster.
For those who land here in the future. Be careful, if the integration isn't working correctly, Datadog sees these as "custom" metrics, and you're in for a nasty surprise bill.
I hope someone at Datadog will consider reopening this issue and ensure this is applied correctly when targeting a GKE autopilot cluster, as it causes a significant billing discrepancy. The helm chart should handle this more gracefully.
I honestly don't know if this is a bug or a documentation issue.
Output of the info page (if this is a bug)
Describe what happened:
After installation in Kubernetes host, an error message appears as an integration issue.
Describe what you expected:
I expected this error not to happen, or an error analysis which would give clues on how to correct this which looks like misconfiguration.
Steps to reproduce the issue:
Unknown.
Additional environment details (Operating System, Cloud provider, etc):
Cloud provider: Scaleway
PLATFORM: GNU/Linux Hostname, datadog-agent-wx99b OS, GNU/Linux Kernel Name, Linux Processor, x86_64 Kernel Release, 5.4.0-80-generic Kernel Version, DataDog/datadog-agent#90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 Machine, x86_64 Hardware Platform, x86_64