PowerDNS / lightningstream

Lightning Stream syncs LMDB databases through S3 buckets between multiple servers, including PowerDNS Authoritative server 4.8+ LMDBs
https://doc.powerdns.com/lightningstream/
MIT License
25 stars 15 forks source link

PDNS hang issue with lightningstream and containers #75

Closed mschirrmeister closed 1 month ago

mschirrmeister commented 1 month ago

I was playing with PowerDNS and Lightningstream in Kubernetes and ran into a very weird issue. It is a StatefulSet and storage comes from Longhorn. PowerDNS auth and Lightningstream are both the latest versions and it is running on ARM64.

The issue is that a query to pdns (dns or rest api) hangs and returns nothing. Executing the pdnsutil tool on the pdns container hangs as well. Like pdnsutil list-all-zones hangs and the prompt never returns. Even with debug logging, there is no error. It shows the incoming request and that it is processing, but no answer is returned.

First I thought it is related to the pid clashing issue for containers that is mentioned in the docs. But it does not matter what I use for --minimum-pid, the issue persisted.

The issue goes away, if I send a dns query to PDNS for a valid record/zone before lightningstream is started. So the command in my lightingstream container is a shell script, which first sends a DNS query to the pdns container and then starts lightning stream.

It still looks somewhat like a locking issue, but the question is what is different internally how pdns auth server accesses the LMDB backend when there was a dns request to pdns before lightningstream is started vs when there was no request?

It runs stable with no issues so far with the workaround.

Do you have any input on what could be going on here?

Configurations listed below.

My pdns.conf looks like this.

pdns@pdns-ss-0:/$ cat /etc/powerdns/pdns.conf
primary=yes
allow-notify-from=0.0.0.0
allow-axfr-ips=127.0.0.1
api=yes
api-key=secret
config-dir=/etc/powerdns
default-soa-content=a.misconfigured.dns.server.invalid hostmaster.@ 0 10800 3600 604800 3600
default-ttl=3600
default-ksk-algorithm=ed25519
default-zsk-algorithm=ed25519
include-dir=/etc/powerdns/pdns.d
load-modules=liblmdbbackend.so
launch=lmdb
lmdb-filename=/var/lib/powerdns/pdns.lmdb
lmdb-shards=1
lmdb-sync-mode=nometasync
lmdb-schema-version=5
lmdb-random-ids=yes
lmdb-map-size=1000
lmdb-flag-deleted=yes
lmdb-lightning-stream=yes
local-address=0.0.0.0,::
log-dns-details=yes
log-dns-queries=yes
log-timestamp=yes
loglevel=7
loglevel-show=yes
query-logging=yes
resolver=1.1.1.1
# server-id
version-string=anonymous
webserver=yes
webserver-allow-from=127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,192.0.0.0/24
webserver-hash-plaintext-credentials=yes
webserver-loglevel=detailed
webserver-address=0.0.0.0
webserver-port=8081
webserver-password=secret2
zone-cache-refresh-interval=0
zone-metadata-cache-ttl=0

My lightningstream.yaml looks like this.

/app $ cat /lightningstream.yaml
instance: lmdbsync
storage_poll_interval: 10s
lmdb_poll_interval: 10s
storage_force_snapshot_interval: 4h

lmdbs:
  main:
    path: /var/lib/powerdns/pdns.lmdb
    schema_tracks_changes: true
    options:
      no_subdir: true
      create: false
  shard:
    path: /var/lib/powerdns/pdns.lmdb-0
    schema_tracks_changes: true
    options:
      no_subdir: true
      create: false

storage:
  type: s3
  options:
    access_key: pdns
    secret_key: pdns
    bucket: lightningstream
    endpoint_url: http://10.0.3.194:9000
    create_bucket: true
  cleanup:
    enabled: true
    interval: 15m
    must_keep_interval: 24h
    remove_old_instances_interval: 168h

http:
  address: ":8500"

log:
  level: info
  format: human
  timestamp: short

My shell script with the workaround:

/app $ cat start.sh
#!/bin/sh

if [ ${PDNS_LSTREAM_SLEEP}_ == _ ]; then
    echo "no sleep time set"
else
    sleep ${PDNS_LSTREAM_SLEEP}
fi

if [ ${PDNS_LSTREAM_DNS_SERVER}_ == _ ]; then
    echo "no DNS server set"
else
    eval DNS=\$$PDNS_LSTREAM_DNS_SERVER
    echo "ServiceIP: $DNS"
    echo "PodIP: $PDNS_POD_IP"
    dig @$PDNS_POD_IP $PDNS_LSTREAM_DOMAIN $PDNS_QUERY_TYPE
fi

/app/lightningstream --config /app/lightningstream.yaml --minimum-pid 200 --instance ${HOSTNAME}-lstream sync
joel-ling commented 1 month ago

Hello @mschirrmeister, thank you for describing the intriguing phenomenon and its workaround.

As it appears to have played out on a complex setup with many moving parts, we would be stoked if some variables could be excluded/isolated, so as to cut down the search space containing all probable causes.

Since the use of Longhorn introduces many abstractions and interactions, the effects of which on LMDB is yet to be determined, we are curious to learn if you could reproduce the issue on a simpler setup, sans Longhorn.

Any information you could provide about the following could also turn out to be valuable:

(We note that you may be using the Dockerfile in your public repository mentioning the same issue.)

Also highly interesting is the PID-clash hypothesis, considering that LMDB locking appears to be keyed on PIDs. Could you please verify that the Lightning Stream process started with the --minimum-pid option indeed does not share a process ID with the Authoritative Nameserver?

This article about having containers in a Pod share a process namespace presents an alternative means of resolving the PID conflict that might be worth an attempt.

Finally, would you mind sharing your thoughts on the following bullet points under the Caveats section of the LMDB docs?

  • Do not use LMDB databases on remote filesystems, even between processes on the same host. This breaks flock() on some OSes, possibly memory map sync, and certainly sync between programs on different hosts.

  • Opening a database can fail if another process is opening or closing it at exactly the same time.

We look forward to hearing your response.

wojas commented 1 month ago

Note that create: false is used in Lightning Stream, which means that it expects the LMDB to be created by pdns. The query you perform against pdns may be the trigger to create this LMDB.

Suggestions:

wojas commented 1 month ago

Additionally, it is essential that Auth and LS run on the same node with a local filesystem for the LMDB, ideally tmpfs. In a cloud deployment it is S3 that provides your persistent data store, not the LMDB.

mschirrmeister commented 1 month ago

Hello @joel-ling @wojas,

thanks for the answers.

@joel-ling, I will try to do it with something simplier than Longhorn. I am thinking about the hostpath/local storage. Simple NFS would be another option, but on the other hand, in Longhorn a ReadWriteMany volume is already mounted via NFS to the containers.

The filesystem is ext4. I will add the YAML files for all the objects in a separate comment.

Yes, I use an image based on the Dockerfile in my repo. I created that since I did not find anything else for ARM.

The process ids are definitely different. Even without --minimum-pid it was typically higher, since I executed a few commands before lightningstream was started. See below.

marco@loop ~> kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
pdns           1  0.0  0.0   1944   424 ?        Ss   May13   0:11 /usr/bin/tini -- /usr/local/sbin/pdns_server-startup
pdns           7  0.1  1.6 18256748 63740 ?      SLl  May13   7:03 /usr/local/sbin/pdns_server --disable-syslog
root       17337  0.0  0.0   3952  3112 pts/0    Ss+  16:50   0:00 /bin/bash
pdns       18470  0.0  0.0   6440  2492 pts/1    Rs+  16:51   0:00 ps aux

marco@loop ~> kubectl exec -it -n pdns pdns-ss-0 -c pdns-lightningstream -- ps aux
PID   USER     TIME  COMMAND
    1 953       0:00 {start.sh} /bin/sh /app/start.sh
   13 953       0:00 {ld-musl-aarch64} ld-linux-aarch64.so.1 --argv0 /app/light
  208 953       3:43 {ld-musl-aarch64} ld-linux-aarch64.so.1 --argv0 /app/light
  262 953       0:00 sh
  291 953       0:00 ps aux

From the hosts perspective

root@k3stw3 ~# crictl ps | grep pdns-ss-0
9c56b26494970       eee647f7298ef       32 hours ago        Running             pdns-lightningstream           86                  212031ffaf8cb       pdns-ss-0
2927017d5aa67       919d34f6ff451       2 days ago          Running             pdns-ss                        0                   212031ffaf8cb       pdns-ss-0

root@k3stw3 ~# crictl inspect --output go-template --template '{{.info.pid}}' 2927017d5aa67
2164914
root@k3stw3 ~# crictl inspect --output go-template --template '{{.info.pid}}' 9c56b26494970
1337897

I tried out the shareProcessNamespace: true option, but that does not help. It still hangs. PIDs look like this when enabled.

marco@loop ~> kubectl exec -it -n pdns pdns-ss-0 -c pdns-lightningstream -- ps aux
PID   USER     TIME  COMMAND
    1 953       0:00 /pause
   13 953       0:00 /usr/bin/tini -- /usr/local/sbin/pdns_server-startup
   20 953       0:00 /usr/local/sbin/pdns_server --disable-syslog
   21 953       0:00 {ld-musl-aarch64} ld-linux-aarch64.so.1 --argv0 /app/light
  213 953       0:00 {ld-musl-aarch64} ld-linux-aarch64.so.1 --argv0 /app/light
  235 953       0:00 ps aux

My take on this 2 items is, that this (remote fs) is probably the problem. As mentioned above, Longhorn does RWX via NFS.

  • Do not use LMDB databases on remote filesystems, even between processes on the same host. This breaks flock() on some OSes, possibly memory map sync, and certainly sync between programs on different hosts.
  • Opening a database can fail if another process is opening or closing it at exactly the same time.

Item number 2 should not be the problem, because I tested with a long delay where I had a sleep of 60 seconds before lightningstream was started. If there was no request to PDNS during the delay, it hang. If you send a query to PDNS during the delay, all is fine. Which is basically the workaround.

For the first item. It might be the problem, but on the other hand, you get it to work with this specific access pattern. Locking or no locking seems to be fine, whatever LMDB does.

I saw this flock option in the NFS manpage. You can set local_lock in NFS to a few values. This option is set to none. I tried to change it, but it did not take it. Unfortunately, I now found out, you cannot change NFS options (only a few), they are hard coded in the longhorn manager. mount options discussion Testing with another NFS storage might be worth it, even if a remote fs should not be used. I would of course not use this for a real production setup, but locally or for dev environments, the options for persistent storage volumes are limited.

@wojas, create: false is there, since I started the setup with PDNS Auth alone. I had it first with true, but also wanted to see if false makes a difference for my issue. The query does not create the files. From what I have seen when PDNS starts it creates the LMDB files. At least the main file.

For your suggestions. Permissions are ok, I have defined permissions in securityContext, which all containers in the pod use. Creating LMDB in an initContainer should not be needed in the current setup, since the files are persistent. Even if the StatefulSet or all objects in Kubernetes are deleted, the volume within Longhorn stays and will be re-used for the next pod.

The StatefulSet takes care that PDNS Auth and Lightningstream are running on the same node. I did not try tmpfs yet, because the emptyDir volume option mentions in the docu that all the data gets deleted when the pod is removed. Right now I wanted it to be able to run even if Lightningstream or S3 is not available. But your sentence that S3 is or can be the treated as the persistent datastore sounds intersting. I did not think about that yet. I try to test it out.

mschirrmeister commented 1 month ago

Here are my files and some context.

I use a StatefulSet that all containers in a pod are scheduled on the same node and that each pod, if scaled up, is scheduled on a different node.

I also wanted to use existing disks for the pods, since I want that the data survives if pods get deleted. Therefore I use volumeClaimTemplates in the StatefulSet which gives predictable names for the PersistentVolumeClaims. The 3 PersistentVolumes reference these names in claimRef.

In Longhorn I created 3 volumes pdns-ss-data-0, pdns-ss-data-1 and pdns-ss-data-2, which are referenced in the PV under volumeHandle

With this config, you can apply, delete or scale up/down (up to 3) the StatefulSet and it will always use the same disks. I think that is how it is done typically for databases.

storageclass-nfs.yaml

# This storage class creates a durable pv for RWX volumes
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-nfs    
provisioner: driver.longhorn.io
reclaimPolicy: Retain
volumeBindingMode: Immediate
allowVolumeExpansion: true
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "30"
  fromBackup: ""
  fsType: "ext4"
  dataLocality: "disabled"
  nfsOptions: "vers=4.2,noresvport,intr,hard,softerr,timeo=600,retrans=5,local_lock=flock"

pdns-auth-configmap-202405041.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-pdns-auth-202405041
  namespace: pdns
data:
  pdns.conf: |
    primary=yes
    allow-notify-from=0.0.0.0
    allow-axfr-ips=127.0.0.1
    api=yes
    api-key=secret
    config-dir=/etc/powerdns
    default-soa-content=a.misconfigured.dns.server.invalid hostmaster.@ 0 10800 3600 604800 3600
    default-ttl=3600
    default-ksk-algorithm=ed25519
    default-zsk-algorithm=ed25519
    include-dir=/etc/powerdns/pdns.d
    load-modules=liblmdbbackend.so
    launch=lmdb
    lmdb-filename=/var/lib/powerdns/pdns.lmdb
    lmdb-shards=1
    lmdb-sync-mode=nometasync
    lmdb-schema-version=5
    lmdb-random-ids=yes
    lmdb-map-size=1000
    lmdb-flag-deleted=yes
    lmdb-lightning-stream=yes
    # launch=gsqlite3
    # gsqlite3-database=/var/lib/powerdns/pdns.sqlite3
    local-address=0.0.0.0,::
    log-dns-details=yes
    log-dns-queries=yes
    log-timestamp=yes
    # 0 = emergency, 1 = alert, 2 = critical, 3 = error, 4 = warning, 5 = notice, 6 = info, 7 = debug
    loglevel=4
    loglevel-show=yes
    query-logging=yes
    # resolver=blocky.blocky.svc.cluster.local
    # resolver=10.43.226.25
    resolver=1.1.1.1
    # server-id
    version-string=anonymous
    webserver=yes
    webserver-allow-from=127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,192.0.0.0/24
    webserver-hash-plaintext-credentials=yes
    # one of "none", "normal", "detailed"
    webserver-loglevel=normal
    webserver-address=0.0.0.0
    webserver-port=8081
    webserver-password=secret2
    zone-cache-refresh-interval=0
    zone-metadata-cache-ttl=0

pdns-lightningstream-configmap-202405102.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-lightningstream-202405102
  namespace: pdns
data:
  lightningstream.yaml: |
    instance: lmdbsync
    storage_poll_interval: 10s
    lmdb_poll_interval: 10s
    storage_force_snapshot_interval: 4h

    lmdbs:
      main:
        path: /var/lib/powerdns/pdns.lmdb
        schema_tracks_changes: true
        options:
          no_subdir: true
          create: false
      shard:
        path: /var/lib/powerdns/pdns.lmdb-0
        schema_tracks_changes: true
        options:
          no_subdir: true
          create: false

    storage:
      type: s3
      options:
        access_key: pdns
        secret_key: pdns#112
        bucket: lightningstream
        endpoint_url: http://minio.minio.svc.cluster.local:9000
        create_bucket: true
      cleanup:
        enabled: true
        interval: 15m
        must_keep_interval: 24h
        remove_old_instances_interval: 168h

    http:
      address: ":8500"

    log:
      level: info
      format: human
      timestamp: short

pdns-lightningstream-start-configmap-202405111.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-lightningstream-start-202405111
  namespace: pdns
data:
  start.sh: |
    #!/bin/sh

    if [ ${PDNS_LSTREAM_SLEEP}_ == _ ]; then
        echo "no sleep time set"
    else
        sleep ${PDNS_LSTREAM_SLEEP}
    fi

    if [ ${PDNS_LSTREAM_DNS_SERVER}_ == _ ]; then
        echo "no DNS server set"
    else
        eval DNS=\$$PDNS_LSTREAM_DNS_SERVER
        echo "ServiceIP: $DNS"
        echo "PodIP: $PDNS_POD_IP"
        dig @$PDNS_POD_IP $PDNS_LSTREAM_DOMAIN $PDNS_QUERY_TYPE
    fi

    # seq 210 | xargs -Iz echo "Generating pids. Count z"
    /app/lightningstream --config /lightningstream.yaml --minimum-pid 200 --instance ${HOSTNAME}-lstream sync

pdns-statefulset-nfs-all.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: pdns
  labels:
    name: pdns
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: pdns-ss-pv-data-nfs-0
spec:
  persistentVolumeReclaimPolicy: Retain
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  claimRef:
    namespace: pdns
    name: pdns-ss-vol-nfs-pdns-ss-0
  csi:
    driver: driver.longhorn.io
    fsType: ext4
    volumeHandle: pdns-ss-data-0
  storageClassName: longhorn-nfs
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: pdns-ss-pv-data-nfs-1
spec:
  persistentVolumeReclaimPolicy: Retain
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  claimRef:
    namespace: pdns
    name: pdns-ss-vol-nfs-pdns-ss-1
  csi:
    driver: driver.longhorn.io
    fsType: ext4
    volumeHandle: pdns-ss-data-1
  storageClassName: longhorn-nfs
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: pdns-ss-pv-data-nfs-2
spec:
  persistentVolumeReclaimPolicy: Retain
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  claimRef:
    namespace: pdns
    name: pdns-ss-vol-nfs-pdns-ss-2
  csi:
    driver: driver.longhorn.io
    fsType: ext4
    volumeHandle: pdns-ss-data-2
  storageClassName: longhorn-nfs
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: pdns-ss
  namespace: pdns
  labels:
    app.kubernetes.io/name: pdns-ss
spec:
  persistentVolumeClaimRetentionPolicy:
    whenScaled: Retain
    whenDeleted: Delete
  replicas: 1
  selector:
    matchLabels:
      app: pdns-ss
  serviceName: pdns-ss
  template:
    metadata:
      labels:
        app: pdns-ss
        app.kubernetes.io/name: pdns-ss
    spec:
      securityContext:
        fsGroup: 953
        runAsUser: 953
        runAsGroup: 953
      containers:
      - name: pdns-ss
        # image: powerdns/pdns-auth-49:4.9.0
        image: mschirrmeister/pdns-auth:master
        env:
        - name: TZ
          value: "Europe/Berlin"
        ports:
          - containerPort: 53
            protocol: TCP
          - containerPort: 53
            protocol: UDP
          - containerPort: 8081
            protocol: TCP
        volumeMounts:
        - name: config
          mountPath: "/etc/powerdns/pdns.conf"
          subPath: pdns.conf
        - name: resolver-config
          mountPath: "/etc/powerdns/pdns.d"
        - name: pdns-ss-vol-nfs
          mountPath: /var/lib/powerdns
      - name: pdns-lightningstream
        image: mschirrmeister/powerdns-lightningstream:v0.4.3
        env:
        - name: PDNS_LSTREAM_DNS_SERVER
          value: PDNS_SS_SERVICE_HOST
        - name: PDNS_LSTREAM_SLEEP
          value: "1"
        - name: PDNS_LSTREAM_DOMAIN
          value: "dummy.zone"
        - name: PDNS_QUERY_TYPE
          value: SOA
        - name: PDNS_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        command: [ "/app/start.sh" ]
        volumeMounts:
        - name: pdns-ss-vol-nfs
          mountPath: /var/lib/powerdns
          readOnly: false
        - name: config-lightningstream
          mountPath: "/lightningstream.yaml"
          subPath: lightningstream.yaml
        - name: config-lightningstream-start
          mountPath: "/app/start.sh"
          subPath: start.sh
      # - name: debug-container
      #   image: alpine:latest
      #   imagePullPolicy: Always
      #   args: ["tail", "-f", "/dev/null"]
      #   volumeMounts:
      #   - name: pdns-ss-vol
      #     mountPath: /var/lib/powerdns
      #     # readOnly: true
      restartPolicy: Always
      volumes:
      - name: config
        configMap:
          name: app-config-pdns-auth-202405041
          items:
          - key: "pdns.conf"
            path: "pdns.conf"
      - name: config-lightningstream
        configMap:
          name: app-config-lightningstream-202405102
          items:
          - key: "lightningstream.yaml"
            path: "lightningstream.yaml"
      - name: config-lightningstream-start
        configMap:
          name: app-config-lightningstream-start-202405111
          items:
          - key: "start.sh"
            path: "start.sh"
          defaultMode: 0755
      - name: resolver-config
        emptyDir: {}
      initContainers:
      - name: pdns-resolver-config
        image: jonlabelle/network-tools
        command: [ "sh" ]
        # args: [ "-c", "echo resolver=`dig blocky.blocky.svc.cluster.local +short` > /resolver-config/resolver.conf" ]
        args: [ "-c", "echo" ]
        volumeMounts:
        - name: resolver-config
          mountPath: /resolver-config
  volumeClaimTemplates:
  - metadata:
      name: pdns-ss-vol-nfs
    spec:
      accessModes:
        - ReadWriteMany
      resources:
        requests:
          storage: 1Gi
      storageClassName: longhorn-nfs
---
apiVersion: v1
kind: Service
metadata:
  name: pdns-ss
  namespace: pdns
  annotations:
    metallb.universe.tf/address-pool: second-pool
    pdns.prometheus/scrape: "true"
    pdns.prometheus/scheme: "http"
    pdns.prometheus/path: "/metrics"
    pdns.prometheus/port: "8081"
spec:
  type: LoadBalancer
  externalTrafficPolicy: Cluster
  # externalTrafficPolicy: Local
  selector:
    app: pdns-ss
  ports:
    - name: dnstcp
      protocol: TCP
      port: 53
      targetPort: 53
    - name: dnsudp
      protocol: UDP
      port: 53
      targetPort: 53
    - name: pdnshttp
      protocol: TCP
      port: 8081
      targetPort: 8081
mschirrmeister commented 1 month ago

I did some tests with a simple setup. No Longhorn involved. Following scenarios were tested.

All 3 show the same symptom / issue.

I think that rules out a remote fs or nfs issue.
My development cluster has kernel 5.15.93 and I thought maybe it is a kernel issue. So I added another node with the latest 6.8.10 kernel to it, but it did not make any difference.

emptyDir is not even usable with the workaround, because the data is deleted when the StatefulSet is deleted. Which means a dns query before LS start is not possible, since LS has to load data first.

local volume can be used with the workaround. If you start empty, you have to start initially without the LS container to seed/insert some valid data into LMDB which can be used on the next start.

I have created simple yaml files and recorded all the steps how to reproduce it on a single node. I will post this in the next comment. Will be a long one.

mschirrmeister commented 1 month ago

Environment

ARM64 SBC

Initially tested on a Kubernetes cluster with 3 Rock Pi 4b worker nodes, Debian Bullseye, Kernel 5.15.93-rockchip64 (Armbian)
Reproduction environment (steps below) is Rock 5b, Linux Debian Bookworm, Kernel 6.8.10-edge-rockchip-rk3588 (Armbian)

Prepare

Login as root

Install k3sup to deploy K3s

curl -sLS https://get.k3sup.dev | sh

Install MinIO client

curl -O https://dl.min.io/client/mc/release/linux-arm64/mc
install mc /usr/local/bin/mc

Directories

mkdir -p /mnt/localstorage/minio
mkdir -p /mnt/localstorage/pdns-auth

Install

Install K3s

k3sup install --local --k3s-extra-args "--disable servicelb --disable traefik" --k3s-version v1.29.2+k3s1

kubectl get node -o wide
NAME                        STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                            KERNEL-VERSION                CONTAINER-RUNTIME
k3spdnstest.home.marco.cx   Ready    control-plane,master   27s   v1.29.2+k3s1   10.0.3.18     <none>        Armbian 24.5.0-trunk.645 bookworm   6.8.10-edge-rockchip-rk3588   containerd://1.7.11-k3s2

PDNS files to reproduce the issue

git clone https://github.com/mschirrmeister/pdns-ls-issue-repro
cd pdns-ls-issue-repro/

Important

Modify the files minio-pdns-deployment.yaml and pdns-statefulset-local-all.yaml and replace the hostname under nodeAffinity.
Get your hostname with one of the following commands.

kubectl get node
kubectl get node -o jsonpath="{.items[0].metadata.labels.kubernetes\.io/hostname}"

Deploy MiniIO

kubectl apply -f minio-pdns-deployment.yaml

Configure MiniIO client

kubectl get service -n minio -o jsonpath="{.items[0].spec.clusterIP}"
# Use ip from above command
mc alias set pdnstest http://10.43.55.235:9000/ pdns1234 pdns1234

# or in one command
mc alias set pdnstest http://$(kubectl get service -n minio -o jsonpath="{.items[0].spec.clusterIP}"):9000/ pdns1234 pdns1234

Setup user and permissions in MinIO

# access key = pdns    secret key = pdns#112       used in lightingstream config
mc admin user add pdnstest pdns pdns#112
mc admin group add pdnstest users pdns
# mc admin user info pdnstest pdns

mc admin policy create pdnstest readwriteusers read-write-pdns.json
# mc admin policy ls pdnstest
# mc admin policy info pdnstest readwriteusers
mc admin policy attach pdnstest readwriteusers --group users
# mc admin group info pdnstest users

tmpfs

Deploy PDNS Auth and Lightningstream

kubectl apply -f pdns-ns.yaml
kubectl apply -f pdns-auth-configmap.yaml
kubectl apply -f pdns-lightningstream-configmap.yaml
kubectl apply -f pdns-lightningstream-start-configmap.yaml
kubectl apply -f pdns-statefulset-tmpfs-all.yaml

Check for the pods and containers and wait until they are started

kubectl get pod -A -o wide --watch

Check logs

kubectl logs -f -n pdns pdns-ss-0 pdns-ss
kubectl logs -f -n pdns pdns-ss-0 pdns-lightningstream

Check pdns data directory to verify it is tmpfs

root@k3spdnstest ~# kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- mount | grep -Ei '(tmpfs.+powerdns)'
tmpfs on /var/lib/powerdns type tmpfs (rw,relatime,size=204800k)
root@k3spdnstest ~# kubectl exec -it -n pdns pdns-ss-0 -c pdns-lightningstream -- mount | grep -Ei '(tmpfs.+powerdns)'
tmpfs on /var/lib/powerdns type tmpfs (rw,relatime,size=204800k)

Read the data in S3

mc ls --recursive --versions pdnstest/lightningstream
[2024-05-18 19:40:45 CEST]   249B STANDARD null v1 PUT main__lmdbsync__20240518-174045-849810029__G-0000000000000000.pb.gz

Do a dns query

# This query will work, but since the zone does not exists we will get a "status: REFUSED"
dig foobar.baz soa @$(kubectl get service -n pdns -o jsonpath="{.items[0].spec.clusterIP}")

Login to PDNS

kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- bash
# This should still work
pdns@pdns-ss-0:/$ pdnsutil list-all-zones

# This will hang
pdns@pdns-ss-0:/$ pdnsutil create-zone foobar.baz
Creating empty zone 'foobar.baz'
^C

Do another dns query (This will hang / timeout)

kubectl get service -n pdns -o jsonpath="{.items[0].spec.clusterIP}"
dig foobar.baz soa @10.43.149.101

# Or with one command
dig foobar.baz soa @$(kubectl get service -n pdns -o jsonpath="{.items[0].spec.clusterIP}")

Notes

At this point, the above command pdnsutil create-zone foobar.baz created something in LMDB. The zone is somewhat created (but not correctly), because no SOA like defined in the pdns config. If you redeploy the statefulset, dig or pdnsutil will hang on the first run. Because there is now data loaded and there was no query. It is not even possible to implement a valid query before LS starts, because LS needs to load the data first.

Try out with the following.

kubectl delete -f pdns-statefulset-tmpfs-all.yaml
kubectl apply -f pdns-statefulset-tmpfs-all.yaml

Both commands will will hang

kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- pdnsutil list-all-zones
dig foobar.baz soa @$(kubectl get service -n pdns -o jsonpath="{.items[0].spec.clusterIP}")

Reset

kubectl delete -f pdns-statefulset-tmpfs-all.yaml
mc rm --recursive pdnstest/lightningstream --force
mc ls --recursive --versions pdnstest/lightningstream

Start over

kubectl apply -f pdns-statefulset-tmpfs-all.yaml

Now you can run dns queries again (like above on the first run), which will of course not return anything, but they don't hang. You can also run kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- pdnsutil list-all-zones again which should not hang, but of course will also not return anything.

local volume

Workaround setup with local persistent storage across deployments.
Delete the tmpfs StatefulSet before you start.

Full cleanup

kubectl delete -f pdns-statefulset-tmpfs-all.yaml
kubectl delete -f pdns-statefulset-local-all.yaml
kubectl delete -f minio-pdns-deployment.yaml
rm -rf /mnt/localstorage/minio/*
rm -rf /mnt/localstorage/minio/.minio.sys
rm -rf /mnt/localstorage/pdns-auth/*
rm -rf ~/.mc/
k3s-uninstall.sh
mschirrmeister commented 1 month ago

I think I found the problem. The problem is the image I build. When I did not find an arm image in the beginning, I created my own Dockerfile based on one of my others for go where I used Alpine. After further debugging over the last days I did saw a comment in your Dockerfile about a possible locking issue when using Alpine.

I tested with a Debian based image and that works fine. No idea if there is a way to get it work with Alpine, besides with the workaround query. For now I will switch to a Debian based image.

wojas commented 1 month ago

Good to hear, thank you for your detailed investigation!

LMDB does interesting things with locking, it's strongly recommended to stick to the same libc for multiple processes sharing the same LMDB.

joel-ling commented 1 month ago

Hello @mschirrmeister, thank you for your detailed report, which we followed closely and were able to reproduce! Observing that pdnsutil create-zone and subsequently pdnsutil list-all-zones hang on a certain futex() system call seems to corroborate the idea that the issue is lock-related indeed.

To prove your hypothesis about Alpine being a cause (refined by @wojas pointing to libc as the crux of the issue), I tried replacing the C library shipped by Alpine with the GNU C Library (albeit a custom build by @sgerrand to save time), and voila! No more blocking.

Here's the very rough Dockerfile put together for Science to see if Alpine can get along with Lightning Stream. It resulted in a 43 MB Docker image, slightly larger than your base image at 37 MB. For scale, Alpine weighs about 7 MB while Debian is 117 MB as at the time of writing.

FROM mschirrmeister/powerdns-lightningstream

RUN wget -q -O /etc/apk/keys/sgerrand.rsa.pub https://alpine-pkgs.sgerrand.com/sgerrand.rsa.pub \
 && wget https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.35-r1/glibc-2.35-r1.apk \
 && apk add --force-overwrite glibc-2.35-r1.apk \
 && rm -rf /var/cache/apk/* glibc-2.35-r1.apk

# Commands copied and pasted from https://github.com/sgerrand/alpine-pkg-glibc?tab=readme-ov-file#installing

(You will most likely need to perform extra steps to obtain build artifacts for arm64.)

Perhaps @hyc might like to add another caveat to the LMDB docs about ensuring a compatible C library? 😅

Thank you for your interest in PowerDNS and Lightning Stream, and for your active contribution to this discovery.

sgerrand commented 1 month ago

Changing the C library from the native musl version to another compiled on Debian doesn't sound like this problem is fixed. Perhaps the root cause could be related to where your software is being compiled? Is there a version of pdnsutil compiled on and for Alpine Linux?

hyc commented 1 month ago

In all this you never got a stack trace of any of the hanging processes? The fact that using glibc avoids the problem points to a bug in musl's pthread_mutex implementation. It would make more sense to report a bug to them than for us to add a caveat to LMDB's docs.

Oh I see, if you're running multiple processes each linked against different libc implementations, yeah that will cause problems on systems using POSIX mutexes. To do interprocess concurrency with Pthread mutexes, the actual pthread_mutex structure lives in the shared memory map. If two different libc's define their pthread_mutex struct differently, then they cannot interoperate.

There is a related issue already in our tracker https://bugs.openldap.org/show_bug.cgi?id=8582 Even using the same version of glibc, the pthread_mutex struct differs between 32bit and 64bit binaries, which is another source of incompatibility.

joel-ling commented 1 month ago

Thanks to @sgerrand and @hyc for joining the conversation; very happy to have your participation!

It would appear that the proliferation of container technology has lowered the barrier for processes to run on the same machine and share a memory map, all the while linking to different implementations of the C standard library.

This shifts the common denominator between interacting processes one level down from APIs to the underlying data structures and their manipulation, falling just outside the scope of the standards. Who would have thought!

The onus is then on the administrator of a container platform to ensure that programs operating on the same data via operating system APIs do so in a compatible manner. This may in practice translate to running such programs on the same operating system.

Other database management systems that expose APIs over the network and do not share state satisfy the "principles" (or should one say assumptions) of a cloud-native landscape, whereas LMDB (for very good reasons!) does not comply. It may be painfully obvious to those familiar, but for the sake of posterity, let it be documented: caveat emptor!

hyc commented 1 month ago

Just as a sidenote, musl's pthread_mutex_t is also wordsize-dependent, so incompatible between 32 and 64 bit processes. https://git.musl-libc.org/cgit/musl/tree/include/alltypes.h.in

Seems unlikely that anyone could convince them to use a glibc-compatible definition. https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/nptl/bits/struct_mutex.h;h=f2b9b7e04d3843eb8b5936137255856ad2461731;hb=HEAD

The musl pthread_mutex_t field definitions https://git.musl-libc.org/cgit/musl/tree/src/internal/pthread_impl.h#n86

glibc records the lock owner, musl does not. glibc records the lock type in its 4th or 5th int, musl in its first int. musl always has a prev pointer, glibc only has one on 64bit arches

Kinda hopeless.

Perhaps you could build a standalone pthread library and make sure it is LD_PRELOADed into every process.

wojas commented 1 month ago

@hyc Thank you for the detailed explanation of why this happens. We avoid it internally by ensuring that we always use the same libc for all processes after encountering this issue several years ago. This should be added to the Lightning Stream docs.

And thanks for LMDB in general, of course!