Issue with FQDN resolution between ClickHouse cluster nodes – unable to use full service names in pod names

mahesh-kore commented 2 weeks ago

Description:

We are configuring a ClickHouse cluster and want to use Fully Qualified Domain Names (FQDN) instead of shortnames for communication between cluster nodes. However, even after adding the following parameter in the YAML configuration, it is not working as expected. Instead of the expected FQDN format ({podname}.{headless-svc}.{namespace}.cluster.local), ClickHouse is resolving it to {namespace}.cluster.local.

Steps to Reproduce:

spec:
  defaults:
    replicasUseFQDN: "yes"

Deploy the ClickHouse cluster with the above settings.

Error:

chi-test-test-0-0-0 2024.11.14 18:20:12.532646 [ 356 ] {e966ed13-fe70-4efb-9e05-ff6024698ffe} <Error> DNSResolver: Cannot resolve host (chi-test-test-1-2.default.svc.cluster.local), error 0: Host not found.

Expected Behavior:

The FQDN should resolve in the format {podname}.{headless-svc}.{namespace}.cluster.local, Example: chi-test-test-0-0-0.chi-test-test-0-0.default.svc.cluster.local

Actual Behavior:

The FQDN is resolving as {headless-svc}.{namespace}.cluster.local, which is incorrect. The pod name is not included in the FQDN.

Additional Information:

Tried using the replicasUseFQDN: "yes" parameter, but the desired behavior was not achieved. Need assistance on the correct configuration or potential bugs in the current setup.

template used:


apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "test"
spec:
  defaults:
    replicasUseFQDN: "yes"
  configuration:
    clusters:
      - name: "test"
        layout:
          shardsCount: 3
          replicasCount: 3
        templates:
          podTemplate: clickhouse-stable
          dataVolumeClaimTemplate: clickhouse-data-volume
          serviceTemplate: svc-template
    zookeeper:
        nodes:
        - host: zookeeper-headless.kore.svc.cluster.local
          port: 2181
    users:
        test/password: kore123
        test/profile: default
        test/quota: default
        test/networks/ip:
            - 0.0.0.0/0
            - ::/0
    files:
      config.d/log_rotation.xml: |-
        <clickhouse>
            <logger>
                <level>information</level>
                <log>/var/log/clickhouse-server/clickhouse-server.log</log>
                <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
                <size>100M</size>
                <count>5</count>
                <console>1</console>
            </logger>
        </clickhouse>
  templates:
    podTemplates:
    - name: clickhouse-stable
      spec:
        nodeSelector:
          role: master
        dnsConfig:
          options:
          - name: use-vc
        containers:
        - name: clickhouse
          image: clickhouse/clickhouse-server:23.4.2
          resources:
            requests:
              memory: "1024Mi"
              cpu: "500m"
            limits:
              memory: "4048Mi"
              cpu: "2000m"
    volumeClaimTemplates:
      - name: clickhouse-data-volume
        spec:
                #storageClassName: gp3
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 20Gi

Slach commented 2 weeks ago

Communications between clickhouse nodes inside cluster wukk always use service names instead of directly pod names

replicasUseFQDN: "yes"

means using full name of services

mahesh-kore commented 2 weeks ago

The issue appears to be that DNS resolution over UDP is blocked in this environment. We've configured the pods to use TCP for DNS resolution, and testing with ping confirms it works. However, ClickHouse is failing to resolve the service name over TCP, resulting in the following error:

2024.11.14 18:20:17.660787 [ 48 ] {c1e33f52-b6b1-45e6-b1e0-c24514136aa9} <Error> DNSResolver: Cannot resolve host (chi-test-test-1-2.default.svc.cluster.local), error 0: Host not found

root@chi-test-test-0-0-0:/# ping chi-test-test-1-2.default.svc.cluster.local
PING chi-test-test-1-2.default.svc.cluster.local (10.42.0.123) 56(84) bytes of data.
64 bytes from chi-test-test-1-2-0.chi-test-test-1-2.default.svc.cluster.local (10.42.0.123): icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from chi-test-test-1-2-0.chi-test-test-1-2.default.svc.cluster.local (10.42.0.123): icmp_seq=2 ttl=64 time=0.053 ms

How can we resolve this?

Altinity / clickhouse-operator