This is really a post-incident writeup for future reference just in case something similar occurs again.
Background & Motivation
After experiencing some ongoing issues with unreasonable IOWait on some nodes and related slow ceph apply/commit times (2+ seconds), I decided to 'repave' the ceph OSDs with enterprise-grade SSDs and consolidate OSDs down to only nodes with 10GBe network links.
The plan was to use some samsung SM863a's in order to improve performance. The runbook was essentially this:
OSD Replacements
OSD replacements were basically being one one node-at-a-time following this guide from the rook documentation. Everything was progressing well, although the recover operations when taking an old OSD down and when adding a new/replacement OSD took the most time.
Essentially, for a given node/OSD the process was:
I think I must have started to get impatient and by the time I was processing the last node, I did not wait for the rook ceph cluster to become 100% healthy before starting the process to remove the very last OSD (on k3s-a). I did also notice that OSD 0 was marked 'down' right around this time. I can't find it in my tmux history after-the-fact, but maybe I did the Scale-down OSD deployment step against OSD 0 instead of OSD 3.
Once this occured, the rook ceph cluster started reporting an angry red status of HEALTH_ERR with the following health details,
HEALTH_ERR 1/3 mons down, quorum e,g; 2/184703 objects unfound (0.001%); Reduced data availability: 1 pg inactive; Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 34241/554109 objects degraded (6.179%), 19 pgs degraded, 19 pgs under
sized
[WRN] MON_DOWN: 1/3 mons down, quorum e,g
mon.h (rank 2) addr [v2:10.43.230.31:3300/0,v1:10.43.230.31:6789/0] is down (out of quorum)
[WRN] OBJECT_UNFOUND: 2/184703 objects unfound (0.001%)
pg 10.12 has 2 unfound objects
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
pg 2.2 is stuck inactive for 7m, current state undersized+degraded+remapped+backfilling+peered, last acting [4]
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
pg 10.12 is active+recovery_unfound+undersized+degraded+remapped, acting [4,NONE,0], 2 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 34241/554109 objects degraded (6.179%), 19 pgs degraded, 19 pgs undersized
pg 2.2 is stuck undersized for 7m, current state undersized+degraded+remapped+backfilling+peered, last acting [4]
pg 2.4 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,2]
pg 2.5 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
pg 2.9 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,4]
pg 2.a is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
pg 2.d is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,0]
pg 2.e is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,2]
pg 2.10 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
pg 2.12 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [4,0]
pg 2.18 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
pg 2.19 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,4]
pg 2.1e is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
pg 2.1f is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,2]
pg 10.4 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,0,NONE]
pg 10.a is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [4,NONE,0]
pg 10.d is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,NONE,4]
pg 10.12 is stuck undersized for 7m, current state active+recovery_unfound+undersized+degraded+remapped, last acting [4,NONE,0]
pg 10.1d is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,NONE,4]
pg 10.1e is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,NONE,4]
The key part is the message about [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
Failures
Examining the ceph error more directly, the pg with problems was 10.12. Running ceph pg 10.12 list_unfound revealed the following interesting info:
OSD 3 is the one that I already purged and removed, so that data was never going to come back.
I tried to instruct ceph to repair (ceph pg repair 10.12) and deep-scrub (ceph pg deep-scrub 10.12) the affected pg but nothing seemed to improve or change.
Around this time I noticed some things starting to fail in the cluster:
volsync jobs stopped working
grafana and teslamate stopped working (issues talking to postgres)
most rook ceph-block based workloads started to fail or become unresponsive
thanos and loki started to degrade and fail (using rook object storage)
Essentially, it looks like rook/ceph storage becomes read-only or even unusable when the cluster is in this state, or at least in the HEALTH_ERR state.
Cloudnative-PG issues
There were also issues with it properly evicting from a node during cordon because of a pod disruption budget issue. Not sure how you are supposed to properly reschedule it to a new node during maintenance.
The postgres cluster (using ceph storage) degraded and failed. It was perpetually stuck in a terminating state (for over an hour).
It struggled to reschedule to a new node and complained about attaching to a PVC already in-use.
Recovery
Let ceph rebalance OSDs overnight to see if it would eventually repair itself. It did not.
In the morning, I took the following steps to attempt a recovery:
tried repair & deep scrub again - no change
tried revert operation (ceph pg 10.12 mark_unfound_lost revert but this failed with Error EINVAL: mode must be 'delete' for ec pool)
... As soon as the delete operation completed, the cluster got out of error state and everything started to recover on their own on the kubernetes side. Ceph also started a long-running backfill operation to handle the deleted object.
Cloudnative-PG recovery
Postgres (cloudnative-pg) was still not recovered, unfortunately. It complained about a multi-attach warning still for a given ceph object.
Digging a bit revealed that there was a 'stuck' rbd that needed to be unmapped. It was on the k3s-c node and I drained and rebooted that node. After the reboot, the stuck RBD cleared. This did not help postgres which continued to not start properly.
Decided to restore postgres from a backup which was taken shortly before the ceph errors. Process I followed was:
scale-down cloudnative-pg operator to 0 (k scale deployment cloudnative-pg --replicas=0)
wait for cluster to recover from backup via this definition:
bootstrap:
# use this to recover a net-new cluster from a backup
recovery:
source: postgres-backup
externalClusters:
# this represents the s3 backup to restore from. *nota-bene: the backup must be the same major version of the target cluster
- name: postgres-backup
barmanObjectStore:
wal:
compression: bzip2
maxParallel: 8
destinationPath: s3://postgresql/
endpointURL: http://truenas.home:9000
s3Credentials:
accessKeyId:
name: cloudnative-pg-secret
key: aws-access-key-id
secretAccessKey:
name: cloudnative-pg-secret
key: aws-secret-access-key
... The cluster restored from backup without issue.
Conclusion
In all, I did not observe any data loss (that I know of) but was prepared to burn it all down and restore all volumes from backup if necessary. Going forward, I will be very judicious about waiting for ceph cluster to be completely healthy before starting new operations, and double-checking commands.
This is really a post-incident writeup for future reference just in case something similar occurs again.
Background & Motivation
After experiencing some ongoing issues with unreasonable IOWait on some nodes and related slow ceph apply/commit times (2+ seconds), I decided to 'repave' the ceph OSDs with enterprise-grade SSDs and consolidate OSDs down to only nodes with 10GBe network links.
The plan was to use some samsung SM863a's in order to improve performance. The runbook was essentially this:
OSD Replacements
OSD replacements were basically being one one node-at-a-time following this guide from the rook documentation. Everything was progressing well, although the recover operations when taking an old OSD down and when adding a new/replacement OSD took the most time.
Essentially, for a given node/OSD the process was:
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
)kubectl -n rook-ceph scale deployment rook-ceph-osd-3 --replicas=0
ceph osd down osd.3
)ceph osd out osd.3
)ceph osd purge 3 --yes-i-really-mean-it
)sudo sgdisk -z /dev/sd<n>
)sudo sgdisk -z /dev/sd<n>
)kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
)Problems start
I think I must have started to get impatient and by the time I was processing the last node, I did not wait for the rook ceph cluster to become 100% healthy before starting the process to remove the very last OSD (on k3s-a). I did also notice that OSD 0 was marked 'down' right around this time. I can't find it in my tmux history after-the-fact, but maybe I did the Scale-down OSD deployment step against OSD 0 instead of OSD 3.
Once this occured, the rook ceph cluster started reporting an angry red status of
HEALTH_ERR
with the following health details,The key part is the message about
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
Failures
Examining the ceph error more directly, the pg with problems was
10.12
. Runningceph pg 10.12 list_unfound
revealed the following interesting info:OSD 3 is the one that I already purged and removed, so that data was never going to come back.
I tried to instruct ceph to repair (
ceph pg repair 10.12
) and deep-scrub (ceph pg deep-scrub 10.12
) the affected pg but nothing seemed to improve or change.Around this time I noticed some things starting to fail in the cluster:
Essentially, it looks like rook/ceph storage becomes read-only or even unusable when the cluster is in this state, or at least in the
HEALTH_ERR
state.Cloudnative-PG issues
There were also issues with it properly evicting from a node during cordon because of a pod disruption budget issue. Not sure how you are supposed to properly reschedule it to a new node during maintenance.
The postgres cluster (using ceph storage) degraded and failed. It was perpetually stuck in a terminating state (for over an hour). It struggled to reschedule to a new node and complained about attaching to a PVC already in-use.
Recovery
Let ceph rebalance OSDs overnight to see if it would eventually repair itself. It did not.
In the morning, I took the following steps to attempt a recovery:
ceph pg 10.12 mark_unfound_lost revert
but this failed withError EINVAL: mode must be 'delete' for ec pool
)ceph pg 10.12 mark_unfound_lost delete
)... As soon as the delete operation completed, the cluster got out of error state and everything started to recover on their own on the kubernetes side. Ceph also started a long-running backfill operation to handle the deleted object.
Cloudnative-PG recovery
Postgres (cloudnative-pg) was still not recovered, unfortunately. It complained about a multi-attach warning still for a given ceph object.
Digging a bit revealed that there was a 'stuck' rbd that needed to be unmapped. It was on the k3s-c node and I drained and rebooted that node. After the reboot, the stuck RBD cleared. This did not help postgres which continued to not start properly.
Decided to restore postgres from a backup which was taken shortly before the ceph errors. Process I followed was:
k scale deployment cloudnative-pg --replicas=0
)k delete cluster postgres-v15
)mv postgres-v15 postgres-backup
k scale deployment cloudnative-pg --replicas=1
)... The cluster restored from backup without issue.
Conclusion
In all, I did not observe any data loss (that I know of) but was prepared to burn it all down and restore all volumes from backup if necessary. Going forward, I will be very judicious about waiting for ceph cluster to be completely healthy before starting new operations, and double-checking commands.