Closed mschirrmeister closed 1 month ago
Hello @mschirrmeister, thank you for describing the intriguing phenomenon and its workaround.
As it appears to have played out on a complex setup with many moving parts, we would be stoked if some variables could be excluded/isolated, so as to cut down the search space containing all probable causes.
Since the use of Longhorn introduces many abstractions and interactions, the effects of which on LMDB is yet to be determined, we are curious to learn if you could reproduce the issue on a simpler setup, sans Longhorn.
Any information you could provide about the following could also turn out to be valuable:
fsType
of the relevant StorageClass(We note that you may be using the Dockerfile in your public repository mentioning the same issue.)
Also highly interesting is the PID-clash hypothesis, considering that LMDB locking appears to be keyed on PIDs. Could you please verify that the Lightning Stream process started with the --minimum-pid
option indeed does not share a process ID with the Authoritative Nameserver?
This article about having containers in a Pod share a process namespace presents an alternative means of resolving the PID conflict that might be worth an attempt.
Finally, would you mind sharing your thoughts on the following bullet points under the Caveats section of the LMDB docs?
Do not use LMDB databases on remote filesystems, even between processes on the same host. This breaks flock() on some OSes, possibly memory map sync, and certainly sync between programs on different hosts.
Opening a database can fail if another process is opening or closing it at exactly the same time.
We look forward to hearing your response.
Note that create: false
is used in Lightning Stream, which means that it expects the LMDB to be created by pdns. The query you perform against pdns may be the trigger to create this LMDB.
Suggestions:
Additionally, it is essential that Auth and LS run on the same node with a local filesystem for the LMDB, ideally tmpfs. In a cloud deployment it is S3 that provides your persistent data store, not the LMDB.
Hello @joel-ling @wojas,
thanks for the answers.
@joel-ling, I will try to do it with something simplier than Longhorn. I am thinking about the hostpath/local storage. Simple NFS would be another option, but on the other hand, in Longhorn a ReadWriteMany volume is already mounted via NFS to the containers.
The filesystem is ext4
. I will add the YAML files for all the objects in a separate comment.
Yes, I use an image based on the Dockerfile in my repo. I created that since I did not find anything else for ARM.
The process ids are definitely different. Even without --minimum-pid
it was typically higher, since I executed a few commands before lightningstream was started. See below.
marco@loop ~> kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
pdns 1 0.0 0.0 1944 424 ? Ss May13 0:11 /usr/bin/tini -- /usr/local/sbin/pdns_server-startup
pdns 7 0.1 1.6 18256748 63740 ? SLl May13 7:03 /usr/local/sbin/pdns_server --disable-syslog
root 17337 0.0 0.0 3952 3112 pts/0 Ss+ 16:50 0:00 /bin/bash
pdns 18470 0.0 0.0 6440 2492 pts/1 Rs+ 16:51 0:00 ps aux
marco@loop ~> kubectl exec -it -n pdns pdns-ss-0 -c pdns-lightningstream -- ps aux
PID USER TIME COMMAND
1 953 0:00 {start.sh} /bin/sh /app/start.sh
13 953 0:00 {ld-musl-aarch64} ld-linux-aarch64.so.1 --argv0 /app/light
208 953 3:43 {ld-musl-aarch64} ld-linux-aarch64.so.1 --argv0 /app/light
262 953 0:00 sh
291 953 0:00 ps aux
From the hosts perspective
root@k3stw3 ~# crictl ps | grep pdns-ss-0
9c56b26494970 eee647f7298ef 32 hours ago Running pdns-lightningstream 86 212031ffaf8cb pdns-ss-0
2927017d5aa67 919d34f6ff451 2 days ago Running pdns-ss 0 212031ffaf8cb pdns-ss-0
root@k3stw3 ~# crictl inspect --output go-template --template '{{.info.pid}}' 2927017d5aa67
2164914
root@k3stw3 ~# crictl inspect --output go-template --template '{{.info.pid}}' 9c56b26494970
1337897
I tried out the shareProcessNamespace: true
option, but that does not help. It still hangs. PIDs look like this when enabled.
marco@loop ~> kubectl exec -it -n pdns pdns-ss-0 -c pdns-lightningstream -- ps aux
PID USER TIME COMMAND
1 953 0:00 /pause
13 953 0:00 /usr/bin/tini -- /usr/local/sbin/pdns_server-startup
20 953 0:00 /usr/local/sbin/pdns_server --disable-syslog
21 953 0:00 {ld-musl-aarch64} ld-linux-aarch64.so.1 --argv0 /app/light
213 953 0:00 {ld-musl-aarch64} ld-linux-aarch64.so.1 --argv0 /app/light
235 953 0:00 ps aux
My take on this 2 items is, that this (remote fs) is probably the problem. As mentioned above, Longhorn does RWX via NFS.
- Do not use LMDB databases on remote filesystems, even between processes on the same host. This breaks flock() on some OSes, possibly memory map sync, and certainly sync between programs on different hosts.
- Opening a database can fail if another process is opening or closing it at exactly the same time.
Item number 2 should not be the problem, because I tested with a long delay where I had a sleep of 60 seconds before lightningstream was started. If there was no request to PDNS during the delay, it hang. If you send a query to PDNS during the delay, all is fine. Which is basically the workaround.
For the first item. It might be the problem, but on the other hand, you get it to work with this specific access pattern. Locking or no locking seems to be fine, whatever LMDB does.
I saw this flock
option in the NFS manpage. You can set local_lock
in NFS to a few values. This option is set to none
. I tried to change it, but it did not take it. Unfortunately, I now found out, you cannot change NFS options (only a few), they are hard coded in the longhorn manager. mount options discussion
Testing with another NFS storage might be worth it, even if a remote fs should not be used. I would of course not use this for a real production setup, but locally or for dev environments, the options for persistent storage volumes are limited.
@wojas, create: false
is there, since I started the setup with PDNS Auth alone. I had it first with true
, but also wanted to see if false
makes a difference for my issue.
The query does not create the files. From what I have seen when PDNS starts it creates the LMDB files. At least the main file.
For your suggestions. Permissions are ok, I have defined permissions in securityContext
, which all containers in the pod use.
Creating LMDB in an initContainer should not be needed in the current setup, since the files are persistent. Even if the StatefulSet or all objects in Kubernetes are deleted, the volume within Longhorn stays and will be re-used for the next pod.
The StatefulSet takes care that PDNS Auth and Lightningstream are running on the same node. I did not try tmpfs yet, because the emptyDir
volume option mentions in the docu that all the data gets deleted when the pod is removed. Right now I wanted it to be able to run even if Lightningstream or S3 is not available.
But your sentence that S3 is or can be the treated as the persistent datastore sounds intersting. I did not think about that yet. I try to test it out.
Here are my files and some context.
I use a StatefulSet that all containers in a pod are scheduled on the same node and that each pod, if scaled up, is scheduled on a different node.
I also wanted to use existing disks for the pods, since I want that the data survives if pods get deleted.
Therefore I use volumeClaimTemplates
in the StatefulSet which gives predictable names for the PersistentVolumeClaims
.
The 3 PersistentVolumes
reference these names in claimRef
.
In Longhorn I created 3 volumes pdns-ss-data-0
, pdns-ss-data-1
and pdns-ss-data-2
, which are referenced in the PV under volumeHandle
With this config, you can apply, delete or scale up/down (up to 3) the StatefulSet and it will always use the same disks. I think that is how it is done typically for databases.
storageclass-nfs.yaml
# This storage class creates a durable pv for RWX volumes
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: longhorn-nfs
provisioner: driver.longhorn.io
reclaimPolicy: Retain
volumeBindingMode: Immediate
allowVolumeExpansion: true
parameters:
numberOfReplicas: "3"
staleReplicaTimeout: "30"
fromBackup: ""
fsType: "ext4"
dataLocality: "disabled"
nfsOptions: "vers=4.2,noresvport,intr,hard,softerr,timeo=600,retrans=5,local_lock=flock"
pdns-auth-configmap-202405041.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config-pdns-auth-202405041
namespace: pdns
data:
pdns.conf: |
primary=yes
allow-notify-from=0.0.0.0
allow-axfr-ips=127.0.0.1
api=yes
api-key=secret
config-dir=/etc/powerdns
default-soa-content=a.misconfigured.dns.server.invalid hostmaster.@ 0 10800 3600 604800 3600
default-ttl=3600
default-ksk-algorithm=ed25519
default-zsk-algorithm=ed25519
include-dir=/etc/powerdns/pdns.d
load-modules=liblmdbbackend.so
launch=lmdb
lmdb-filename=/var/lib/powerdns/pdns.lmdb
lmdb-shards=1
lmdb-sync-mode=nometasync
lmdb-schema-version=5
lmdb-random-ids=yes
lmdb-map-size=1000
lmdb-flag-deleted=yes
lmdb-lightning-stream=yes
# launch=gsqlite3
# gsqlite3-database=/var/lib/powerdns/pdns.sqlite3
local-address=0.0.0.0,::
log-dns-details=yes
log-dns-queries=yes
log-timestamp=yes
# 0 = emergency, 1 = alert, 2 = critical, 3 = error, 4 = warning, 5 = notice, 6 = info, 7 = debug
loglevel=4
loglevel-show=yes
query-logging=yes
# resolver=blocky.blocky.svc.cluster.local
# resolver=10.43.226.25
resolver=1.1.1.1
# server-id
version-string=anonymous
webserver=yes
webserver-allow-from=127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,192.0.0.0/24
webserver-hash-plaintext-credentials=yes
# one of "none", "normal", "detailed"
webserver-loglevel=normal
webserver-address=0.0.0.0
webserver-port=8081
webserver-password=secret2
zone-cache-refresh-interval=0
zone-metadata-cache-ttl=0
pdns-lightningstream-configmap-202405102.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config-lightningstream-202405102
namespace: pdns
data:
lightningstream.yaml: |
instance: lmdbsync
storage_poll_interval: 10s
lmdb_poll_interval: 10s
storage_force_snapshot_interval: 4h
lmdbs:
main:
path: /var/lib/powerdns/pdns.lmdb
schema_tracks_changes: true
options:
no_subdir: true
create: false
shard:
path: /var/lib/powerdns/pdns.lmdb-0
schema_tracks_changes: true
options:
no_subdir: true
create: false
storage:
type: s3
options:
access_key: pdns
secret_key: pdns#112
bucket: lightningstream
endpoint_url: http://minio.minio.svc.cluster.local:9000
create_bucket: true
cleanup:
enabled: true
interval: 15m
must_keep_interval: 24h
remove_old_instances_interval: 168h
http:
address: ":8500"
log:
level: info
format: human
timestamp: short
pdns-lightningstream-start-configmap-202405111.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config-lightningstream-start-202405111
namespace: pdns
data:
start.sh: |
#!/bin/sh
if [ ${PDNS_LSTREAM_SLEEP}_ == _ ]; then
echo "no sleep time set"
else
sleep ${PDNS_LSTREAM_SLEEP}
fi
if [ ${PDNS_LSTREAM_DNS_SERVER}_ == _ ]; then
echo "no DNS server set"
else
eval DNS=\$$PDNS_LSTREAM_DNS_SERVER
echo "ServiceIP: $DNS"
echo "PodIP: $PDNS_POD_IP"
dig @$PDNS_POD_IP $PDNS_LSTREAM_DOMAIN $PDNS_QUERY_TYPE
fi
# seq 210 | xargs -Iz echo "Generating pids. Count z"
/app/lightningstream --config /lightningstream.yaml --minimum-pid 200 --instance ${HOSTNAME}-lstream sync
pdns-statefulset-nfs-all.yaml
apiVersion: v1
kind: Namespace
metadata:
name: pdns
labels:
name: pdns
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pdns-ss-pv-data-nfs-0
spec:
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
claimRef:
namespace: pdns
name: pdns-ss-vol-nfs-pdns-ss-0
csi:
driver: driver.longhorn.io
fsType: ext4
volumeHandle: pdns-ss-data-0
storageClassName: longhorn-nfs
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pdns-ss-pv-data-nfs-1
spec:
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
claimRef:
namespace: pdns
name: pdns-ss-vol-nfs-pdns-ss-1
csi:
driver: driver.longhorn.io
fsType: ext4
volumeHandle: pdns-ss-data-1
storageClassName: longhorn-nfs
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pdns-ss-pv-data-nfs-2
spec:
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
claimRef:
namespace: pdns
name: pdns-ss-vol-nfs-pdns-ss-2
csi:
driver: driver.longhorn.io
fsType: ext4
volumeHandle: pdns-ss-data-2
storageClassName: longhorn-nfs
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: pdns-ss
namespace: pdns
labels:
app.kubernetes.io/name: pdns-ss
spec:
persistentVolumeClaimRetentionPolicy:
whenScaled: Retain
whenDeleted: Delete
replicas: 1
selector:
matchLabels:
app: pdns-ss
serviceName: pdns-ss
template:
metadata:
labels:
app: pdns-ss
app.kubernetes.io/name: pdns-ss
spec:
securityContext:
fsGroup: 953
runAsUser: 953
runAsGroup: 953
containers:
- name: pdns-ss
# image: powerdns/pdns-auth-49:4.9.0
image: mschirrmeister/pdns-auth:master
env:
- name: TZ
value: "Europe/Berlin"
ports:
- containerPort: 53
protocol: TCP
- containerPort: 53
protocol: UDP
- containerPort: 8081
protocol: TCP
volumeMounts:
- name: config
mountPath: "/etc/powerdns/pdns.conf"
subPath: pdns.conf
- name: resolver-config
mountPath: "/etc/powerdns/pdns.d"
- name: pdns-ss-vol-nfs
mountPath: /var/lib/powerdns
- name: pdns-lightningstream
image: mschirrmeister/powerdns-lightningstream:v0.4.3
env:
- name: PDNS_LSTREAM_DNS_SERVER
value: PDNS_SS_SERVICE_HOST
- name: PDNS_LSTREAM_SLEEP
value: "1"
- name: PDNS_LSTREAM_DOMAIN
value: "dummy.zone"
- name: PDNS_QUERY_TYPE
value: SOA
- name: PDNS_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
command: [ "/app/start.sh" ]
volumeMounts:
- name: pdns-ss-vol-nfs
mountPath: /var/lib/powerdns
readOnly: false
- name: config-lightningstream
mountPath: "/lightningstream.yaml"
subPath: lightningstream.yaml
- name: config-lightningstream-start
mountPath: "/app/start.sh"
subPath: start.sh
# - name: debug-container
# image: alpine:latest
# imagePullPolicy: Always
# args: ["tail", "-f", "/dev/null"]
# volumeMounts:
# - name: pdns-ss-vol
# mountPath: /var/lib/powerdns
# # readOnly: true
restartPolicy: Always
volumes:
- name: config
configMap:
name: app-config-pdns-auth-202405041
items:
- key: "pdns.conf"
path: "pdns.conf"
- name: config-lightningstream
configMap:
name: app-config-lightningstream-202405102
items:
- key: "lightningstream.yaml"
path: "lightningstream.yaml"
- name: config-lightningstream-start
configMap:
name: app-config-lightningstream-start-202405111
items:
- key: "start.sh"
path: "start.sh"
defaultMode: 0755
- name: resolver-config
emptyDir: {}
initContainers:
- name: pdns-resolver-config
image: jonlabelle/network-tools
command: [ "sh" ]
# args: [ "-c", "echo resolver=`dig blocky.blocky.svc.cluster.local +short` > /resolver-config/resolver.conf" ]
args: [ "-c", "echo" ]
volumeMounts:
- name: resolver-config
mountPath: /resolver-config
volumeClaimTemplates:
- metadata:
name: pdns-ss-vol-nfs
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
storageClassName: longhorn-nfs
---
apiVersion: v1
kind: Service
metadata:
name: pdns-ss
namespace: pdns
annotations:
metallb.universe.tf/address-pool: second-pool
pdns.prometheus/scrape: "true"
pdns.prometheus/scheme: "http"
pdns.prometheus/path: "/metrics"
pdns.prometheus/port: "8081"
spec:
type: LoadBalancer
externalTrafficPolicy: Cluster
# externalTrafficPolicy: Local
selector:
app: pdns-ss
ports:
- name: dnstcp
protocol: TCP
port: 53
targetPort: 53
- name: dnsudp
protocol: UDP
port: 53
targetPort: 53
- name: pdnshttp
protocol: TCP
port: 8081
targetPort: 8081
I did some tests with a simple setup. No Longhorn involved. Following scenarios were tested.
local
volumeemptyDir
volume, stored on diskemptyDir
volume, stored in memory (tmpfs
)All 3 show the same symptom / issue.
I think that rules out a remote fs or nfs issue.
My development cluster has kernel 5.15.93 and I thought maybe it is a kernel issue. So I added another node with the latest 6.8.10 kernel to it, but it did not make any difference.
emptyDir
is not even usable with the workaround, because the data is deleted when the StatefulSet is deleted. Which means a dns query before LS start is not possible, since LS has to load data first.
local
volume can be used with the workaround. If you start empty, you have to start initially without the LS container to seed/insert some valid data into LMDB which can be used on the next start.
I have created simple yaml files and recorded all the steps how to reproduce it on a single node. I will post this in the next comment. Will be a long one.
ARM64 SBC
Initially tested on a Kubernetes cluster with 3 Rock Pi 4b worker nodes, Debian Bullseye, Kernel 5.15.93-rockchip64 (Armbian)
Reproduction environment (steps below) is Rock 5b, Linux Debian Bookworm, Kernel 6.8.10-edge-rockchip-rk3588 (Armbian)
Login as root
Install k3sup
to deploy K3s
curl -sLS https://get.k3sup.dev | sh
Install MinIO client
curl -O https://dl.min.io/client/mc/release/linux-arm64/mc
install mc /usr/local/bin/mc
Directories
mkdir -p /mnt/localstorage/minio
mkdir -p /mnt/localstorage/pdns-auth
Install K3s
k3sup install --local --k3s-extra-args "--disable servicelb --disable traefik" --k3s-version v1.29.2+k3s1
kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k3spdnstest.home.marco.cx Ready control-plane,master 27s v1.29.2+k3s1 10.0.3.18 <none> Armbian 24.5.0-trunk.645 bookworm 6.8.10-edge-rockchip-rk3588 containerd://1.7.11-k3s2
PDNS files to reproduce the issue
git clone https://github.com/mschirrmeister/pdns-ls-issue-repro
cd pdns-ls-issue-repro/
Important
Modify the files minio-pdns-deployment.yaml
and pdns-statefulset-local-all.yaml
and replace the hostname under nodeAffinity
.
Get your hostname with one of the following commands.
kubectl get node
kubectl get node -o jsonpath="{.items[0].metadata.labels.kubernetes\.io/hostname}"
Deploy MiniIO
kubectl apply -f minio-pdns-deployment.yaml
Configure MiniIO client
kubectl get service -n minio -o jsonpath="{.items[0].spec.clusterIP}"
# Use ip from above command
mc alias set pdnstest http://10.43.55.235:9000/ pdns1234 pdns1234
# or in one command
mc alias set pdnstest http://$(kubectl get service -n minio -o jsonpath="{.items[0].spec.clusterIP}"):9000/ pdns1234 pdns1234
Setup user and permissions in MinIO
# access key = pdns secret key = pdns#112 used in lightingstream config
mc admin user add pdnstest pdns pdns#112
mc admin group add pdnstest users pdns
# mc admin user info pdnstest pdns
mc admin policy create pdnstest readwriteusers read-write-pdns.json
# mc admin policy ls pdnstest
# mc admin policy info pdnstest readwriteusers
mc admin policy attach pdnstest readwriteusers --group users
# mc admin group info pdnstest users
Deploy PDNS Auth and Lightningstream
kubectl apply -f pdns-ns.yaml
kubectl apply -f pdns-auth-configmap.yaml
kubectl apply -f pdns-lightningstream-configmap.yaml
kubectl apply -f pdns-lightningstream-start-configmap.yaml
kubectl apply -f pdns-statefulset-tmpfs-all.yaml
Check for the pods and containers and wait until they are started
kubectl get pod -A -o wide --watch
Check logs
kubectl logs -f -n pdns pdns-ss-0 pdns-ss
kubectl logs -f -n pdns pdns-ss-0 pdns-lightningstream
Check pdns data directory to verify it is tmpfs
root@k3spdnstest ~# kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- mount | grep -Ei '(tmpfs.+powerdns)'
tmpfs on /var/lib/powerdns type tmpfs (rw,relatime,size=204800k)
root@k3spdnstest ~# kubectl exec -it -n pdns pdns-ss-0 -c pdns-lightningstream -- mount | grep -Ei '(tmpfs.+powerdns)'
tmpfs on /var/lib/powerdns type tmpfs (rw,relatime,size=204800k)
Read the data in S3
mc ls --recursive --versions pdnstest/lightningstream
[2024-05-18 19:40:45 CEST] 249B STANDARD null v1 PUT main__lmdbsync__20240518-174045-849810029__G-0000000000000000.pb.gz
Do a dns query
# This query will work, but since the zone does not exists we will get a "status: REFUSED"
dig foobar.baz soa @$(kubectl get service -n pdns -o jsonpath="{.items[0].spec.clusterIP}")
Login to PDNS
kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- bash
# This should still work
pdns@pdns-ss-0:/$ pdnsutil list-all-zones
# This will hang
pdns@pdns-ss-0:/$ pdnsutil create-zone foobar.baz
Creating empty zone 'foobar.baz'
^C
Do another dns query (This will hang / timeout)
kubectl get service -n pdns -o jsonpath="{.items[0].spec.clusterIP}"
dig foobar.baz soa @10.43.149.101
# Or with one command
dig foobar.baz soa @$(kubectl get service -n pdns -o jsonpath="{.items[0].spec.clusterIP}")
Notes
At this point, the above command pdnsutil create-zone foobar.baz
created something in LMDB. The zone is somewhat created (but not correctly), because no SOA like defined in the pdns config.
If you redeploy the statefulset, dig or pdnsutil will hang on the first run. Because there is now data loaded and there was no query. It is not even possible to implement a valid query before LS starts, because LS needs to load the data first.
Try out with the following.
kubectl delete -f pdns-statefulset-tmpfs-all.yaml
kubectl apply -f pdns-statefulset-tmpfs-all.yaml
Both commands will will hang
kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- pdnsutil list-all-zones
dig foobar.baz soa @$(kubectl get service -n pdns -o jsonpath="{.items[0].spec.clusterIP}")
kubectl delete -f pdns-statefulset-tmpfs-all.yaml
mc rm --recursive pdnstest/lightningstream --force
mc ls --recursive --versions pdnstest/lightningstream
kubectl apply -f pdns-statefulset-tmpfs-all.yaml
Now you can run dns queries again (like above on the first run), which will of course not return anything, but they don't hang. You can also run kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- pdnsutil list-all-zones
again which should not hang, but of course will also not return anything.
Workaround setup with local persistent storage across deployments.
Delete the tmpfs
StatefulSet before you start.
pdns-statefulset-local-all.yaml
kubectl apply -f pdns-statefulset-local-all.yaml
kubectl exec -it -n pdns pdns-ss-0 -c pdns-ss -- pdnsutil create-zone foobar.baz
kubectl delete -f pdns-statefulset-local-all.yaml
pdns-statefulset-local-all.yaml
kubectl apply -f pdns-statefulset-local-all.yaml
kubectl delete -f pdns-statefulset-tmpfs-all.yaml
kubectl delete -f pdns-statefulset-local-all.yaml
kubectl delete -f minio-pdns-deployment.yaml
rm -rf /mnt/localstorage/minio/*
rm -rf /mnt/localstorage/minio/.minio.sys
rm -rf /mnt/localstorage/pdns-auth/*
rm -rf ~/.mc/
k3s-uninstall.sh
I think I found the problem. The problem is the image I build. When I did not find an arm image in the beginning, I created my own Dockerfile based on one of my others for go where I used Alpine. After further debugging over the last days I did saw a comment in your Dockerfile about a possible locking issue when using Alpine.
I tested with a Debian based image and that works fine. No idea if there is a way to get it work with Alpine, besides with the workaround query. For now I will switch to a Debian based image.
Good to hear, thank you for your detailed investigation!
LMDB does interesting things with locking, it's strongly recommended to stick to the same libc for multiple processes sharing the same LMDB.
Hello @mschirrmeister, thank you for your detailed report, which we followed closely and were able to reproduce! Observing that pdnsutil create-zone
and subsequently pdnsutil list-all-zones
hang on a certain futex()
system call seems to corroborate the idea that the issue is lock-related indeed.
To prove your hypothesis about Alpine being a cause (refined by @wojas pointing to libc
as the crux of the issue), I tried replacing the C library shipped by Alpine with the GNU C Library (albeit a custom build by @sgerrand to save time), and voila! No more blocking.
Here's the very rough Dockerfile
put together for Science to see if Alpine can get along with Lightning Stream. It resulted in a 43 MB Docker image, slightly larger than your base image at 37 MB. For scale, Alpine weighs about 7 MB while Debian is 117 MB as at the time of writing.
FROM mschirrmeister/powerdns-lightningstream
RUN wget -q -O /etc/apk/keys/sgerrand.rsa.pub https://alpine-pkgs.sgerrand.com/sgerrand.rsa.pub \
&& wget https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.35-r1/glibc-2.35-r1.apk \
&& apk add --force-overwrite glibc-2.35-r1.apk \
&& rm -rf /var/cache/apk/* glibc-2.35-r1.apk
# Commands copied and pasted from https://github.com/sgerrand/alpine-pkg-glibc?tab=readme-ov-file#installing
(You will most likely need to perform extra steps to obtain build artifacts for arm64
.)
Perhaps @hyc might like to add another caveat to the LMDB docs about ensuring a compatible C library? 😅
Thank you for your interest in PowerDNS and Lightning Stream, and for your active contribution to this discovery.
Changing the C library from the native musl version to another compiled on Debian doesn't sound like this problem is fixed. Perhaps the root cause could be related to where your software is being compiled? Is there a version of pdnsutil
compiled on and for Alpine Linux?
In all this you never got a stack trace of any of the hanging processes? The fact that using glibc avoids the problem points to a bug in musl's pthread_mutex implementation. It would make more sense to report a bug to them than for us to add a caveat to LMDB's docs.
Oh I see, if you're running multiple processes each linked against different libc implementations, yeah that will cause problems on systems using POSIX mutexes. To do interprocess concurrency with Pthread mutexes, the actual pthread_mutex structure lives in the shared memory map. If two different libc's define their pthread_mutex struct differently, then they cannot interoperate.
There is a related issue already in our tracker https://bugs.openldap.org/show_bug.cgi?id=8582 Even using the same version of glibc, the pthread_mutex struct differs between 32bit and 64bit binaries, which is another source of incompatibility.
Thanks to @sgerrand and @hyc for joining the conversation; very happy to have your participation!
It would appear that the proliferation of container technology has lowered the barrier for processes to run on the same machine and share a memory map, all the while linking to different implementations of the C standard library.
This shifts the common denominator between interacting processes one level down from APIs to the underlying data structures and their manipulation, falling just outside the scope of the standards. Who would have thought!
The onus is then on the administrator of a container platform to ensure that programs operating on the same data via operating system APIs do so in a compatible manner. This may in practice translate to running such programs on the same operating system.
Other database management systems that expose APIs over the network and do not share state satisfy the "principles" (or should one say assumptions) of a cloud-native landscape, whereas LMDB (for very good reasons!) does not comply. It may be painfully obvious to those familiar, but for the sake of posterity, let it be documented: caveat emptor!
Just as a sidenote, musl's pthread_mutex_t is also wordsize-dependent, so incompatible between 32 and 64 bit processes. https://git.musl-libc.org/cgit/musl/tree/include/alltypes.h.in
Seems unlikely that anyone could convince them to use a glibc-compatible definition. https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/nptl/bits/struct_mutex.h;h=f2b9b7e04d3843eb8b5936137255856ad2461731;hb=HEAD
The musl pthread_mutex_t field definitions https://git.musl-libc.org/cgit/musl/tree/src/internal/pthread_impl.h#n86
glibc records the lock owner, musl does not. glibc records the lock type in its 4th or 5th int, musl in its first int. musl always has a prev pointer, glibc only has one on 64bit arches
Kinda hopeless.
Perhaps you could build a standalone pthread library and make sure it is LD_PRELOADed into every process.
@hyc Thank you for the detailed explanation of why this happens. We avoid it internally by ensuring that we always use the same libc for all processes after encountering this issue several years ago. This should be added to the Lightning Stream docs.
And thanks for LMDB in general, of course!
I was playing with PowerDNS and Lightningstream in Kubernetes and ran into a very weird issue. It is a StatefulSet and storage comes from Longhorn. PowerDNS auth and Lightningstream are both the latest versions and it is running on ARM64.
The issue is that a query to pdns (dns or rest api) hangs and returns nothing. Executing the
pdnsutil
tool on the pdns container hangs as well. Likepdnsutil list-all-zones
hangs and the prompt never returns. Even with debug logging, there is no error. It shows the incoming request and that it is processing, but no answer is returned.First I thought it is related to the pid clashing issue for containers that is mentioned in the docs. But it does not matter what I use for
--minimum-pid
, the issue persisted.The issue goes away, if I send a dns query to PDNS for a valid record/zone before lightningstream is started. So the command in my lightingstream container is a shell script, which first sends a DNS query to the pdns container and then starts lightning stream.
It still looks somewhat like a locking issue, but the question is what is different internally how pdns auth server accesses the LMDB backend when there was a dns request to pdns before lightningstream is started vs when there was no request?
It runs stable with no issues so far with the workaround.
Do you have any input on what could be going on here?
Configurations listed below.
My
pdns.conf
looks like this.My
lightningstream.yaml
looks like this.My shell script with the workaround: