rook/ceph OSD repaving incident on 2023-02-22

This is really a post-incident writeup for future reference just in case something similar occurs again.

Background & Motivation

After experiencing some ongoing issues with unreasonable IOWait on some nodes and related slow ceph apply/commit times (2+ seconds), I decided to 'repave' the ceph OSDs with enterprise-grade SSDs and consolidate OSDs down to only nodes with 10GBe network links.

The plan was to use some samsung SM863a's in order to improve performance. The runbook was essentially this:

OSD Replacements

OSD replacements were basically being one one node-at-a-time following this guide from the rook documentation. Everything was progressing well, although the recover operations when taking an old OSD down and when adding a new/replacement OSD took the most time.

Essentially, for a given node/OSD the process was:

Scale-down rook operator (kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0)
Scale-down OSD deployment (e.g. kubectl -n rook-ceph scale deployment rook-ceph-osd-3 --replicas=0
Mark OSD down in the toolbox (e.g. ceph osd down osd.3)
Mark OSD out in the toolbox (e.g. ceph osd out osd.3)
Wait for ceph cluster status to return healthy (this can take a while)
Purge the OSD from the toolbox (e.g. ceph osd purge 3 --yes-i-really-mean-it)
Drain & cordon node where OSD lives
On the node where the OSD lives, sudo sgdisk -z /dev/sd<n>)
Power-down node where OSD lives, remove drive and if necessary replace it with the new drive; power-back up
On the node with the replacement OSD, identify and wipe/zap new OSD drive (e.g. sudo sgdisk -z /dev/sd<n>)
Uncordon node
Scale-up rook operator (kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1)

Problems start

I think I must have started to get impatient and by the time I was processing the last node, I did not wait for the rook ceph cluster to become 100% healthy before starting the process to remove the very last OSD (on k3s-a). I did also notice that OSD 0 was marked 'down' right around this time. I can't find it in my tmux history after-the-fact, but maybe I did the Scale-down OSD deployment step against OSD 0 instead of OSD 3.

Once this occured, the rook ceph cluster started reporting an angry red status of HEALTH_ERR with the following health details,

HEALTH_ERR 1/3 mons down, quorum e,g; 2/184703 objects unfound (0.001%); Reduced data availability: 1 pg inactive; Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 34241/554109 objects degraded (6.179%), 19 pgs degraded, 19 pgs under
sized
[WRN] MON_DOWN: 1/3 mons down, quorum e,g
    mon.h (rank 2) addr [v2:10.43.230.31:3300/0,v1:10.43.230.31:6789/0] is down (out of quorum)
[WRN] OBJECT_UNFOUND: 2/184703 objects unfound (0.001%)
    pg 10.12 has 2 unfound objects
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    pg 2.2 is stuck inactive for 7m, current state undersized+degraded+remapped+backfilling+peered, last acting [4]
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
    pg 10.12 is active+recovery_unfound+undersized+degraded+remapped, acting [4,NONE,0], 2 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 34241/554109 objects degraded (6.179%), 19 pgs degraded, 19 pgs undersized
    pg 2.2 is stuck undersized for 7m, current state undersized+degraded+remapped+backfilling+peered, last acting [4]
    pg 2.4 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,2]
    pg 2.5 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
    pg 2.9 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,4]
    pg 2.a is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
    pg 2.d is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,0]
    pg 2.e is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,2]
    pg 2.10 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
    pg 2.12 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [4,0]
    pg 2.18 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
    pg 2.19 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,4]
    pg 2.1e is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,4]
    pg 2.1f is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,2]
    pg 10.4 is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,0,NONE]
    pg 10.a is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [4,NONE,0]
    pg 10.d is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,NONE,4]
    pg 10.12 is stuck undersized for 7m, current state active+recovery_unfound+undersized+degraded+remapped, last acting [4,NONE,0]
    pg 10.1d is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [0,NONE,4]
    pg 10.1e is stuck undersized for 7m, current state active+undersized+degraded+remapped+backfilling, last acting [2,NONE,4]

The key part is the message about [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound

Failures

Examining the ceph error more directly, the pg with problems was 10.12. Running ceph pg 10.12 list_unfound revealed the following interesting info:

    "recovery_state": [                                                                                                                                                                                                                            [9300/10449]
        {
            "name": "Started/Primary/Active",
            "enter_time": "2023-02-23T04:48:34.548372+0000",
            "might_have_unfound": [
                {
                    "osd": "0(2)",
                    "status": "already probed"
                },
                {
                    "osd": "1(1)",
                    "status": "already probed"
                },
                {
                    "osd": "2(1)",
                    "status": "already probed"
                },
                {
                    "osd": "3(1)",
                    "status": "osd is down"
                }
            ],
            "recovery_progress": {
                "backfill_targets": [
                    "1(1)"
                ],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "recovery_ops": [],
                    "read_ops": []
                }
            }
        },
        {
            "name": "Started",
            "enter_time": "2023-02-23T04:48:34.524475+0000"
        }
    ],

OSD 3 is the one that I already purged and removed, so that data was never going to come back.

I tried to instruct ceph to repair (ceph pg repair 10.12) and deep-scrub (ceph pg deep-scrub 10.12) the affected pg but nothing seemed to improve or change.

Around this time I noticed some things starting to fail in the cluster:

volsync jobs stopped working
grafana and teslamate stopped working (issues talking to postgres)
most rook ceph-block based workloads started to fail or become unresponsive
thanos and loki started to degrade and fail (using rook object storage)

Essentially, it looks like rook/ceph storage becomes read-only or even unusable when the cluster is in this state, or at least in the HEALTH_ERR state.

Cloudnative-PG issues

There were also issues with it properly evicting from a node during cordon because of a pod disruption budget issue. Not sure how you are supposed to properly reschedule it to a new node during maintenance.

The postgres cluster (using ceph storage) degraded and failed. It was perpetually stuck in a terminating state (for over an hour). It struggled to reschedule to a new node and complained about attaching to a PVC already in-use.

Recovery

Let ceph rebalance OSDs overnight to see if it would eventually repair itself. It did not.

In the morning, I took the following steps to attempt a recovery:

tried repair & deep scrub again - no change
tried revert operation (ceph pg 10.12 mark_unfound_lost revert but this failed with Error EINVAL: mode must be 'delete' for ec pool)
tried delete operation (ceph pg 10.12 mark_unfound_lost delete)

... As soon as the delete operation completed, the cluster got out of error state and everything started to recover on their own on the kubernetes side. Ceph also started a long-running backfill operation to handle the deleted object.

Cloudnative-PG recovery

Postgres (cloudnative-pg) was still not recovered, unfortunately. It complained about a multi-attach warning still for a given ceph object.

Digging a bit revealed that there was a 'stuck' rbd that needed to be unmapped. It was on the k3s-c node and I drained and rebooted that node. After the reboot, the stuck RBD cleared. This did not help postgres which continued to not start properly.

Decided to restore postgres from a backup which was taken shortly before the ceph errors. Process I followed was:

scale-down cloudnative-pg operator to 0 (k scale deployment cloudnative-pg --replicas=0)
delete postgres cluster (k delete cluster postgres-v15)
On the S3 side where the backups are stored, mv postgres-v15 postgres-backup
scale-up cloudnative-pg (k scale deployment cloudnative-pg --replicas=1)
wait for cluster to recover from backup via this definition:

  bootstrap:
    # use this to recover a net-new cluster from a backup
    recovery:
      source: postgres-backup

  externalClusters:
    # this represents the s3 backup to restore from. *nota-bene: the backup must be the same major version of the target cluster
    - name: postgres-backup
      barmanObjectStore:
        wal:
          compression: bzip2
          maxParallel: 8
        destinationPath: s3://postgresql/
        endpointURL: http://truenas.home:9000
        s3Credentials:
          accessKeyId:
            name: cloudnative-pg-secret
            key: aws-access-key-id
          secretAccessKey:
            name: cloudnative-pg-secret
            key: aws-secret-access-key

... The cluster restored from backup without issue.

Conclusion

In all, I did not observe any data loss (that I know of) but was prepared to burn it all down and restore all volumes from backup if necessary. Going forward, I will be very judicious about waiting for ceph cluster to be completely healthy before starting new operations, and double-checking commands.

billimek / k8s-gitops