canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.26k stars 759 forks source link

DNS Issues #4048

Open AlexWeinstein92 opened 1 year ago

AlexWeinstein92 commented 1 year ago

Summary

Ultimately my goal is to let my pods communicate with each other via service name. I have been told this should be easily possible with any kubernetes environment, but have not been able to figure it out with microk8s. When I exec into a service shell and curl, I consistently get hostname could not be resolved.

Right now I am also dealing with a pod startup error after doing microk8s enable dns. I have not editted the config file (configmap/coredns) in any way. I am not sure if this is related to my inability to curl between pods.

coredns-99d66c86d-d7brs                    0/1     CrashLoopBackOff   2 (18s ago)    40s
alexweinstein@Alexs-Laptop back-end % microk8s kubectl logs coredns-6f5f9b5d74-fl4k7 -n kube-system
/etc/coredns/Corefile:20 - Error during parsing: Unknown directive 'pods'

What Should Happen Instead?

After the coredns pod starts correctly, I should be able to exec into a pod and then curl [servicename:serviceport] with expected answers.

However, I may be overestimating the need for coredns here - please let me know if it is not necessary for for this task.

Introspection Report

inspection-report-20230622_105638.tar.gz

Environment: Mac M1

neoaggelos commented 1 year ago

Hi @AlexWeinstein92

Can you share the contents of microk8s kubectl get configmap -n kube-system coredns -o yaml?

AlexWeinstein92 commented 1 year ago
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health {
          lameduck 5s
        }
        ready
        log . {
          class error
        }
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . 8.8.8.8 8.8.4.4
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"Corefile":".:53 {\n    errors\n    health {\n      lameduck 5s\n    }\n    ready\n    log . {\n      class error\n    }\n    kubernetes cluster.local in-addr.arpa ip6.arpa {\n      pods insecure\n      fallthrough in-addr.arpa ip6.arpa\n    }\n    prometheus :9153\n    forward . 8.8.8.8 8.8.4.4\n    cache 30\n    loop\n    reload\n    loadbalance\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"EnsureExists","k8s-app":"kube-dns"},"name":"coredns","namespace":"kube-system"}}
  creationTimestamp: "2023-06-21T17:48:15Z"
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    k8s-app: kube-dns
  name: coredns
  namespace: kube-system
  resourceVersion: "336619"
  uid: a559c098-5cd9-48df-941e-9a3dbd17cd5f
AlexWeinstein92 commented 1 year ago

@neoaggelos I realized I had actually modified the file I gave above to include a line pod verified at the end of Corefile{...} block.

Taking that out has resolved the issue with the pod not starting. However I am still unable to curl one service from another's pod. Here are my service & pod definitions for the one I am trying to hit via scylla-db:9042 in case it is helpful.

apiVersion: v1
kind: Service
metadata:
  name: scylla-db
spec:
  selector:
    app: scylla-db
  clusterIP: None
  ports:
    - port: 9042
      targetPort: 9042
---
apiVersion: v1
kind: Pod
metadata:
  name: scylla-db
  labels:
    app: scylla-db
spec:
  hostname: scylla-db
  setHostnameAsFQDN: true
  containers:
    - image: scylladb/scylla:latest
      name: scylla-db
      ports:
        - name: scylla-db
          containerPort: 9042
  hostNetwork: true
neoaggelos commented 1 year ago

Yes, I was about to mention that the pods verified is not in the right place. You should maybe try changing pods insecure a few lines above to pods verified if this is required.

Further, make sure to recreate any pods after the DNS changes, just to make sure that they do not get stale DNS replies/failures. Given that you specifically set hostNetwork: true and setHostnameAsFQDN: true, I would also look at the dnsPolicy field to make sure that your pod can resolve internal hostnames.

Have a look at https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy, and maybe start with a busybox pod to ensure that DNS resolution works as you would expect. Hope this helps!

AlexWeinstein92 commented 1 year ago

@neoaggelos pods verified is not required, nor are hostNetwork:true, setHostnameAsFQDN: true. I found threads via google that suggested these may solve my issues, before digging into the state of my coredns pods.

It seems that pods verified leads to the issue captured here https://github.com/canonical/microk8s/issues/2206. I tried using it, and got the same error messages in the coredns pod. Then I took it out, restarted everything, and now coredns logs no issues.

However even after restarting all services, deployments, and pods, I can't seem to get a curl from one service to the other to work. I have tried going both ways - from my service pod into my db pod and the other way around. No luck so far. I have also tried a busybox pod, which runs fine, but still curl to it from either of my pods results in could not resolve host.

neoaggelos commented 1 year ago

@AlexWeinstein92 can you please share more details about how you define your services and how you try to access them? Otherwise it's hard to understand what issue you are experiencing. Can you please share a steps that consistently reproduce the issue on your side? Thanks

AlexWeinstein92 commented 1 year ago

The services are built with Akka-grpc, except for the gateway service which is built in Akka-http with stateless Actors. The GRPC services rely on event-sourced stateful actor entities for processing messages.

As it stands I can curl or use a gatling test from outside microk8s to send a request to the gateway via localhost:9042. The requests get passed successfully to whatever service it is intended for (I am using just one endpoint in one service to test this, but it should apply to all), but the scylla-db connection (for event sourcing) is what is causing problems right now.

Right now, I can only hit scylla-db service (which I imported from Docker and ran instead of using scylla-operator due to problems running the latter on my M1 mac) if I change the configuration for the akka contact point from something like

datastax-java-driver {
  basic.contact-points = ["scylla-db:9042"]
}

to something that uses the InternalClusterIP (eg. 10.152.183.116 )

datastax-java-driver {
  basic.contact-points = ["10.152.183.116 :9042"]
}

Which is not a robust solution for production or development, given that I have 4 services currently that rely on the same DB service, with 3 more coming soon.

To reproduce the issue you can follow these instructions for creating the docker container of scylla. I had to pull it into multipass to deploy (see instructions at locally built images without a registry)because I had issues with registry.

Then I kubectl exec into either the scylla or service-level pod, and try to curl to the other pod using service name as the URL, inside a bash terminal. Which always seems to result in hostname not found.

AlexWeinstein92 commented 1 year ago

Update: If I do nslookup scylla-db from within the busybox pod, this is the output

Server:    10.152.183.10
Address 1: 10.152.183.10 kube-dns.kube-system.svc.cluster.local

Name:      scylla-db
Address 1: 10.1.254.108 10-1-254-108.scylla-db.default.svc.cluster.local

Still, if I try to curl 10-1-254-108.scylla-db.default.svc.cluster.local (which itself is problematic as a hostname for reasons previously stated), I get error: could not resolve host

AlexWeinstein92 commented 1 year ago

@neoaggelos any ideas here? It's important that I figure this out for my project, and I feel very stuck with it

acesir commented 1 year ago

I am experiencing the same issue on my mac with a basic nginx pod. I setup metallb and can curl the external endpoint, doing nslookup works fine and outputs the below. I can curl the internal pod IP but any attempts to curl nginx.default.svc.cluster.local fails with curl: (6) Could not resolve host: nginx.default.svc.cluster.local

nslookup nginx.default.svc.cluster.local
Server:    10.152.183.10
Address 1: 10.152.183.10 kube-dns.kube-system.svc.cluster.local

Name:      nginx.default.svc.cluster.local
Address 1: 10.152.183.237
AlexWeinstein92 commented 12 months ago

Bumping this since it has been 2 weeks since anyone has offered any suggestions for this. I really would like to get it working - replacing clusterIPs in configuration files is a non-scalable workaround

neoaggelos commented 12 months ago

Hi @AlexWeinstein92, unfortunately, the link you shared was about running ScyllaDB on docker. Would you mind sharing a Kubernetes YAML manifest instead? Indeed, you should not have to rely on using hardcoded service IPs

AlexWeinstein92 commented 12 months ago

@neoaggelos Sorry if there was confusion - the Docker image is being used in the following YAML because the image Scylla provides is not compatible with my M1 machine (ie. I get errors when I try to run their image so I have to package it using Docker)

apiVersion: v1
kind: Service
metadata:
  name: scylla-db
spec:
  type: NodePort
  selector:
    app: scylla-db
  ports:
    - port: 9042
      targetPort: 9042
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scylla-db
  labels:
    app: scylla-db
spec:
  selector:
    matchLabels:
      app: scylla-db
  template:
    metadata:
      labels:
        app: scylla-db
    spec:
      containers:
        - name: scylla-db
          image: scylladb/scylla:latest
          imagePullPolicy: Never
          ports:
            - containerPort: 9042
      hostNetwork: true

Or at least, that is the YAML file I am using now in order to have a ClusterIP. Ideally it would be more like:

apiVersion: v1
kind: Service
metadata:
  name: scylla-db
spec:
  selector:
    app: scylla-db
  clusterIP: None
---
apiVersion: v1
kind: Pod
metadata:
  name: scylla-db
  labels:
    app: scylla-db
spec:
  hostname: scylla-db
  containers:
    - image: scylladb/scylla:latest
      imagePullPolicy: IfNotPresent
      name: scylla-db
      ports:
        - name: scylla-db
          containerPort: 9042
neoaggelos commented 12 months ago

OK, some quick notes:

What are the exact steps that you follow that cause the dns resolution to fail?

I've made the manifest a bit simpler, the rest should not be required. Also, scylladb/scylla:latest worked for me just fine on an M1:

apiVersion: v1
kind: Service
metadata:
  name: scylla-db
spec:
  selector:
    app: scylla-db
  ports:
    - port: 9042
      targetPort: 9042
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scylla-db
  labels:
    app: scylla-db
spec:
  selector:
    matchLabels:
      app: scylla-db
  template:
    metadata:
      labels:
        app: scylla-db
    spec:
      containers:
        - name: scylla-db
          image: scylladb/scylla:latest
          ports:
            - containerPort: 9042

This creates a ClusterIP service that I can access at scylla-db:9042 from pods running in the cluster:

$ microk8s kubectl apply -f manifest.yaml

# wait a while
$ microk8s kubectl get pod,svc
NAME                             READY   STATUS    RESTARTS   AGE
pod/scylla-db-7c4bc8d76c-h8hnz   1/1     Running   0          63s

NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/kubernetes   ClusterIP   10.152.183.1    <none>        443/TCP    9h
service/scylla-db    ClusterIP   10.152.183.78   <none>        9042/TCP   63s

$ microk8s kubectl run --rm -it --image alpine -- sh
If you don't see a command prompt, try pressing enter.
/ # nslookup scylla-db
Server:     10.152.183.10
Address:    10.152.183.10:53

Name:   scylla-db.default.svc.cluster.local
Address: 10.152.183.78

/ # nc -z scylla-db 9042 && echo it works
it works
/ # nc -z scylla-db.default 9042 && echo it works
it works
/ # nc -z scylla-db.default.svc 9042 && echo it works
it works
/ # nc -z scylla-db.default.svc.cluster.local 9042 && echo it works
it works

If you follow the exact steps as above, which step does not do it for you?

AlexWeinstein92 commented 12 months ago

Thanks so much for this @neoaggelos :)

This is what I am seeing:

% microk8s kubectl run --rm -it --image alpine -- sh                                    
If you don't see a command prompt, try pressing enter.
/ # nslookup scylla-db
Server:     10.152.183.10
Address:    10.152.183.10:53

Name:   scylla-db.default.svc.cluster.local
Address: 10.152.183.216

** server can't find scylla-db.svc.cluster.local: NXDOMAIN

** server can't find scylla-db.svc.cluster.local: NXDOMAIN

** server can't find scylla-db.cluster.local: NXDOMAIN

** server can't find scylla-db.cluster.local: NXDOMAIN

** server can't find scylla-db.home: NXDOMAIN

** server can't find scylla-db.home: NXDOMAIN

I am also noticing that my service image (not scylla-db) is telling me nslookup no found when I try from a shell inside it. To be as specific as I can be, the service runs 5 Scala-written, sbt-docker-published containers, 4 of which are GRPC based and 1 of which accepts HTTP requests as a gateway to the others. nslookup not found applies to all containers. They are all built based on eclipse-temurin and they do have ability to curl but always give hostname not found.

As part of the process I restarted microk8s, making sure dns is enabled, and deleted all pods, services, deployments before testing. I am also now using a YAML that looks exactly like the one you shared.

neoaggelos commented 12 months ago

@AlexWeinstein92 OK, then it looks like the resolution works?

What about the nc -z portion?

/ # nc -z scylla-db 9042 && echo it works
it works
/ # nc -z scylla-db.default 9042 && echo it works
it works
/ # nc -z scylla-db.default.svc 9042 && echo it works
it works
/ # nc -z scylla-db.default.svc.cluster.local 9042 && echo it works
it works
AlexWeinstein92 commented 12 months ago

That does all seem to work from the alpine pod

neoaggelos commented 12 months ago

That does all seem to work from the alpine pod

OK, then, what exactly is failing then? You will not be able to resolve this hostname from the host itself, or the multipass VM. Is this what is failing?

AlexWeinstein92 commented 12 months ago

I don't know if I entirely understand your question, but the trouble is specifically that I cannot access hostname from within a container, whether that is scylla-db container or one of my scala service containers. It's strange to me that the alpine image worked because that is the kind of setup I am trying to get working, just with scala Docker containers built on eclipse-temurin instead of alpine

neoaggelos commented 11 months ago

@AlexWeinstein92 ok, can you then give an example workload where the DNS resolution does not work for you? If not, I am not able to reproduce your issue (especially since the service does resolve properly from a pod in the cluster).

Please share the pod that you deploy which is then unable to resolve scylla-db, this might help to pinpoint the issue. Thanks!

AlexWeinstein92 commented 11 months ago

@neoaggelos you can find the code I'm running here https://github.com/improving-app/back-end

There is a README on top level describing how I deploy to microk8s.

To create the docker container I simply use sbt docker:publishLocal and then tag and push to my weinyopp dockerhub repo. To deploy the services I use the microApply.yaml file which can again be found at top level.

AlexWeinstein92 commented 11 months ago

@neoaggelos wondering if you have been able to run my services, if you had any issues?

AlexWeinstein92 commented 11 months ago

@neoaggelos Just an FYI - this issue has not been resolved (I have even tried moving to alpine images for my microservices, but they were very problematic) but I have decided to bypass it in my system by hardcoding a clusterIP in my yaml for the service, which I then also hardcode into my microservice DB connections configuration.

stale[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.