Open felipejfc opened 7 years ago
@mikekap
ping
Hi @felipejfc Sorry for the delay. Could you give us more details about how you deploy the agent? If you could send us the manifest of the daemonset you use (minus the api key, secrets, etc.) it would be great. Also please send us a flare from the agent https://help.datadoghq.com/hc/en-us/articles/204991415-Send-logs-and-configs-to-Datadog-via-flare-command
Finally is 100.124.142.246 in the address space of your pods, or nodes, or something else in the cluster? I'm trying to understand how this address was resolved.
Thanks
hi @hkaj, thanks for your response... lets go then:
This is the daemonset yaml:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
app: dd-agent
name: dd-agent
namespace: datadog
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: dd-agent
template:
metadata:
creationTimestamp: null
labels:
app: dd-agent
name: dd-agent
spec:
containers:
- env:
- name: API_KEY
value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- name: KUBERNETES
value: "yes"
- name: SD_BACKEND
value: docker
- name: NON_LOCAL_TRAFFIC
value: "true"
- name: DD_HOSTNAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: datadog/docker-dd-agent:latest-alpine
imagePullPolicy: Always
name: dd-agent
ports:
- containerPort: 8125
name: dogstatsd
protocol: UDP
resources:
limits:
memory: 400Mi
requests:
memory: 200Mi
volumeMounts:
- mountPath: /var/run/docker.sock
name: dockersocket
- mountPath: /host/proc
name: procdir
readOnly: true
- mountPath: /host/sys/fs/cgroup
name: cgroups
readOnly: true
- mountPath: /opt/datadog-agent/agent/checks.d/matchmaking_check.py
name: matchmaker-check
readOnly: true
subPath: matchmaking_check.py
- mountPath: /opt/datadog-agent/agent/checks.d/tre-check.py
name: tre-check
readOnly: true
subPath: tre-check.py
- mountPath: /opt/datadog-agent/agent/checks.d/sidecar-check.py
name: sidecar-check
readOnly: true
subPath: sidecar-check.py
- mountPath: /opt/datadog-agent/agent/conf.d/auto_conf/tre-check.yaml
name: datadog-config-volume
readOnly: true
subPath: tre-check.yaml
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /var/run/docker.sock
name: dockersocket
- hostPath:
path: /proc
name: procdir
- hostPath:
path: /sys/fs/cgroup
name: cgroups
- configMap:
defaultMode: 420
name: matchmaker-check
name: matchmaker-check
- configMap:
defaultMode: 420
name: tre-check-cm
name: tre-check
- configMap:
defaultMode: 420
name: sidecar-check
name: sidecar-check
- configMap:
defaultMode: 420
name: datadog-config
name: datadog-config-volume
matchmaker deployment has the following annotations:
spec:
replicas: 3
selector:
matchLabels:
app: matchmaker-api
template:
metadata:
annotations:
service-discovery.datadoghq.com/matchmaker-api.check_names: '["matchmaking_check"]'
service-discovery.datadoghq.com/matchmaker-api.init_configs: '[{}]'
service-discovery.datadoghq.com/matchmaker-api.instances: '[[{"prometheus_endpoint":
"http://%%host%%:9090/metrics", "tags":[]}]]'
...
100.124.142.246 is in the pods address space. It's like when I update the deployment and new pods are created, dd-agent does not re-run the discovery and keeps pointing to the old ips...
A bit more information: This graph is a timeseries splitted by pod_name, as you see, first there were 3 pods reporting, then, I delete all 3 pods and kubernetes brought up other 3, the names of the 3 pods remains the same because it's a statefulset, after the new pods are up, dd-agent only collects metrics from 2 of them, the one that has no metrics is the "matchmaker-worker-2"
It's running on kube on node ip-172-20-65-131.ec2.internal:
matchmaker-worker-2 1/1 Running 0 9m 100.126.130.221 ip-172-20-65-131.ec2.internal
Then I've entered in the dd-agent that is running on the same node and this is what I see with bin/agent info:
matchmaking_check (custom)
--------------------------
- instance #0 [ERROR]: "HTTPConnectionPool(host='100.126.130.245', port=9090): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fb181f0a5d0>: Failed to establish a new connection: [Errno 110] Operation timed out',))"
- Collected 0 metrics, 0 events & 0 service checks
It's trying to get metrics from the wrong ip address, if I'm were to guess I'ld say it's the ip address of some of the old pods that I've just deleted.
Hi, I have a check that extends PrometheusCheck, the pods it should scrape are being discovered using service discovery.
On every deploy I'm having to delete all pods from my dd-agent daemonset for the checks to report consistent metrics again, the error is that:
This host "100.124.142.246" is not the ip of any of the new pods I'm deploying (I'm updating a deployment).
It's like agent is not refreshing the pods it should be checking against.