DataDog / chaos-controller

:monkey: :fire: Datadog Failure Injection System for Kubernetes
Apache License 2.0
179 stars 28 forks source link

User Issue: Node level DNS disruptions impact the DNS records of all cluster's Kubernetes Services #390

Closed nikos912000 closed 3 years ago

nikos912000 commented 3 years ago

Describe the bug While testing the DNS disruptions at node level I noticed a couple of critical issues.

In summary, these impact the DNS records of all cluster's Kubernetes Services (*.svc.cluster.local).

To Reproduce Steps to reproduce the behavior:

  1. Start the provided minikube setup
    • make minikube-start
    • make minikube-build
    • make install
  2. Do a nslookup before applying the Disruption:

    / # cat /etc/resolv.conf
    search chaos-demo.svc.cluster.local svc.cluster.local cluster.local
    nameserver 10.96.0.10
    options ndots:5
    
    Name:      demo.chaos-demo.svc.cluster.local
    Address 1: 10.108.230.236 demo.chaos-demo.svc.cluster.local
    
    Name:      dashboard-metrics-scraper.kubernetes-dashboard.svc.cluster.local
    Address 1: 10.107.89.68 dashboard-metrics-scraper.kubernetes-dashboard.svc.cluster.local
  3. Curling the 2 Services from the curl pod works as expected. The same applies to any external hostnames like google.com.
  4. Apply the following Custom Resource:
    apiVersion: chaos.datadoghq.com/v1beta1
    kind: Disruption
    metadata:
      name: dns
      namespace: chaos-engineering
    spec:
      level: node
      selector:
        kubernetes.io/hostname: minikube
      count: 100%
      dns:
        - hostname: demo.chaos-demo.svc.cluster.local
          record:
            type: A
            value: 10.0.0.154,10.0.0.13  
  5. Do a nslookup again:

    / # nslookup demo.chaos-demo.svc.cluster.local
    Name:      demo.chaos-demo.svc.cluster.local
    Address 1: 10.0.0.154
    
    / # nslookup dashboard-metrics-scraper.kubernetes-dashboard.svc.cluster.local
    nslookup: can't resolve 'dashboard-metrics-scraper.kubernetes-dashboard.svc.cluster.local': Name does not resolve
  6. Curling the 2 Services from the curl pod is no more working. External hostnames are still accessible.

Expected behavior The disruption should only impact the provided hostname (demo.chaos-demo.svc.cluster.local).

Environment:

Additional context In minikube there is only one node which means all outgoing calls to the cluster's Kubernetes Services are affected. In a multi-node setup this affects the nodes which are targeted using the label selector.

nikos912000 commented 3 years ago

I think I found what is wrong.

FakeDNS forwards any queries that do not match the specified rules to the IP address set through the dns argument. This is set to 8.8.8.8 by default.

This is fine for external queries or in cases where the cluster's DNS (kube-dns/CoreDNS) uses that address.

Replacing this line with:

if query.domain.decode().endswith('.cluster.local.'):
    addr = ('kube-dns.kube-system.svc.cluster.local', 53)
else:
    addr = ('%s' % (args.dns), 53)

fixes the issue.

A proper solution would be for the FakeDNS script to receive a list of -additional- DNS IPs/ports and patterns. Something like: [(".cluster.local.", "kube-dns.kube-system.svc.cluster.local", 53)]. These would be set in the values.yaml and be passed to the controller and injector.

What do you think @Devatoria @ptnapoleon?

ptnapoleon commented 3 years ago

Thanks for identifying the fix! I do agree that would be the proper solution.

nikos912000 commented 3 years ago

Awesome, thanks @ptnapoleon. I'll be on holidays but if no one else picks this up I'll take care of it once I'm back :)