Data corruption on container image storage

Jeroen0494 commented 2 years ago

Hi,

I'm experiencing data corruption because of this plugin on my ZFS file system.

My server is running Ubuntu 20.04 with root on ZFS on an NVMe drive, with k3s and this plugin for containerd. I've setup a separate dataset for container image according to the documentation, and the datasets within it are sometimes experiencing corruption when my server get's rebooted.

In an attempt to fix it - since the datasets hold images that can be re-downloaded anyway - I've stopped k3s and all pods, deleted all image and datasets, and rebooted my server.

Before the cleanup:

jeroen@mediaserver:~$ sudo zpool status rpool -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:08 with 0 errors on Fri Apr  8 10:58:29 2022
config:

        NAME                                    STATE     READ WRITE CKSUM
        rpool                                   ONLINE       0     0     0
          5ca738f2-6682-b54e-a259-6dac8cafcbbb  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xfb52>:<0x0>
        rpool/containerd/8715:<0x0>
        rpool/containerd/8712:<0x0>

The cleanup:

systemctl stop k3s

# Stop and remove all pods
crictl --runtime-endpoint="unix:///run/containerd/containerd.sock" pods -q | xargs -P10 -n1 -I {} crictl stopp {}
ip netns list | cut -d' ' -f 1 | xargs -n1 ip netns delete; crictl --runtime-endpoint="unix:///run/containerd/containerd.sock" rmp -af
crictl --runtime-endpoint="unix:///run/containerd/containerd.sock" rmp -a

# Stop and remove all containers
crictl --runtime-endpoint="unix:///run/containerd/containerd.sock" ps -q | xargs crictl --runtime-endpoint="unix:///run/containerd/containerd.sock" rm
crictl --runtime-endpoint="unix:///run/containerd/containerd.sock" rm -a

# Prune images
crictl --runtime-endpoint="unix:///run/containerd/containerd.sock" images -q | xargs -n 1 crictl --runtime-endpoint="unix:///run/containerd/containerd.sock" rmi 2>/dev/null

# Cleanup old ZFS volumes
zfs list -o name | grep "rpool/containerd/" | xargs -n 1 zfs destroy -R

After the cleanup:

jeroen@mediaserver:~$ sudo zpool status rpool -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:20 with 0 errors on Tue May  3 16:53:59 2022
config:

        NAME                                    STATE     READ WRITE CKSUM
        rpool                                   ONLINE       0     0     0
          5ca738f2-6682-b54e-a259-6dac8cafcbbb  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xfb52>:<0x0>
        <0x89be>:<0x0>
        <0x88d2>:<0x0>

The ZFS corruption errors persist, and have become unrecognizable.

Dataset overview for rpool:

jeroen@mediaserver:~$ zfs list rpool -r
NAME                                               USED  AVAIL     REFER  MOUNTPOINT
rpool                                             18.9G   427G      192K  /
rpool/ROOT                                        12.3G   427G      192K  none
rpool/ROOT/ubuntu_saxjsv                          12.3G   427G     2.88G  /
rpool/ROOT/ubuntu_saxjsv/srv                       192K   427G      192K  /srv
rpool/ROOT/ubuntu_saxjsv/usr                      68.4M   427G      192K  /usr
rpool/ROOT/ubuntu_saxjsv/usr/local                68.2M   427G     66.5M  /usr/local
rpool/ROOT/ubuntu_saxjsv/var                      6.67G   427G      192K  /var
rpool/ROOT/ubuntu_saxjsv/var/games                 192K   427G      192K  /var/games
rpool/ROOT/ubuntu_saxjsv/var/lib                  6.15G   427G     1.43G  /var/lib
rpool/ROOT/ubuntu_saxjsv/var/lib/AccountsService   204K   427G      204K  /var/lib/AccountsService
rpool/ROOT/ubuntu_saxjsv/var/lib/NetworkManager   4.63M   427G      292K  /var/lib/NetworkManager
rpool/ROOT/ubuntu_saxjsv/var/lib/apt               307M   427G     65.0M  /var/lib/apt
rpool/ROOT/ubuntu_saxjsv/var/lib/dpkg              156M   427G     35.1M  /var/lib/dpkg
rpool/ROOT/ubuntu_saxjsv/var/log                   531M   427G      335M  /var/log
rpool/ROOT/ubuntu_saxjsv/var/mail                  192K   427G      192K  /var/mail
rpool/ROOT/ubuntu_saxjsv/var/snap                  264K   427G      192K  /var/snap
rpool/ROOT/ubuntu_saxjsv/var/spool                1.60M   427G      236K  /var/spool
rpool/ROOT/ubuntu_saxjsv/var/www                   200K   427G      200K  /var/www
rpool/USERDATA                                     859M   427G      192K  /
rpool/USERDATA/jeroen_ciqwh3                       857M   427G      784M  /home/jeroen
rpool/USERDATA/root_ciqwh3                        1.68M   427G      436K  /root
rpool/containerd                                  2.56G   427G     1.82M  /var/lib/containerd/io.containerd.snapshotter.v1.zfs
rpool/containerd/8908                              596K   427G      596K  legacy
rpool/containerd/8909                              176K   427G      628K  legacy
rpool/containerd/8910                             5.81M   427G     5.81M  legacy
rpool/containerd/8911                             1.91M   427G     7.25M  legacy
rpool/containerd/8912                              148K   427G     7.29M  legacy
rpool/containerd/8913                              576K   427G     7.29M  legacy
rpool/containerd/8914                              576K   427G     7.29M  legacy
rpool/containerd/8915                              176K   427G      628K  legacy
rpool/containerd/8916                              176K   427G      628K  legacy
rpool/containerd/8917                              176K   427G      628K  legacy
rpool/containerd/8918                              176K   427G      628K  legacy
rpool/containerd/8919                              176K   427G      628K  legacy
rpool/containerd/8920                             10.1M   427G     10.1M  legacy
rpool/containerd/8921                              176K   427G      628K  legacy
rpool/containerd/8922                             10.1M   427G     10.1M  legacy
rpool/containerd/8923                             10.2M   427G     10.2M  legacy
rpool/containerd/8924                              176K   427G      628K  legacy
rpool/containerd/8925                             15.7M   427G     25.8M  legacy
rpool/containerd/8926                             18.0M   427G     28.0M  legacy
rpool/containerd/8927                             22.9M   427G     32.9M  legacy
rpool/containerd/8928                              172K   427G     25.8M  legacy
rpool/containerd/8929                              132K   427G     25.8M  legacy
rpool/containerd/8930                              208K   427G      208K  legacy
rpool/containerd/8931                              224K   427G     25.8M  legacy
rpool/containerd/8932                             59.0M   427G     59.0M  legacy
rpool/containerd/8933                              152K   427G      216K  legacy
rpool/containerd/8934                              152K   427G      224K  legacy
rpool/containerd/8935                              160K   427G      240K  legacy
rpool/containerd/8936                             37.7M   427G     37.8M  legacy
rpool/containerd/8937                              176K   427G      628K  legacy
rpool/containerd/8938                              176K   427G      628K  legacy
rpool/containerd/8939                              172K   427G     28.1M  legacy
rpool/containerd/8940                              176K   427G      628K  legacy
rpool/containerd/8941                             5.82M   427G     5.82M  legacy
rpool/containerd/8942                              132K   427G     28.1M  legacy
rpool/containerd/8943                              200K   427G      200K  legacy
rpool/containerd/8944                              224K   427G     28.1M  legacy
rpool/containerd/8945                              152K   427G      208K  legacy
rpool/containerd/8946                              176K   427G      628K  legacy
rpool/containerd/8947                              176K   427G      628K  legacy
rpool/containerd/8948                             10.3M   427G     10.4M  legacy
rpool/containerd/8949                              176K   427G      628K  legacy
rpool/containerd/8950                              176K   427G      628K  legacy
rpool/containerd/8951                              176K   427G      628K  legacy
rpool/containerd/8952                              176K   427G      628K  legacy
rpool/containerd/8953                             37.7M   427G     47.9M  legacy
rpool/containerd/8954                             2.17M   427G     7.30M  legacy
rpool/containerd/8955                             32.9M   427G     40.1M  legacy
rpool/containerd/8956                             37.7M   427G     48.0M  legacy
rpool/containerd/8957                              904K   427G      904K  legacy
rpool/containerd/8958                             67.8M   427G     67.8M  legacy
rpool/containerd/8959                             10.4M   427G     48.0M  legacy
rpool/containerd/8960                              176K   427G      628K  legacy
rpool/containerd/8961                              176K   427G      628K  legacy
rpool/containerd/8962                              172K   427G     32.9M  legacy
rpool/containerd/8963                              208K   427G      208K  legacy
rpool/containerd/8964                              208K   427G      208K  legacy
rpool/containerd/8965                             31.4M   427G     79.1M  legacy
rpool/containerd/8966                              448K   427G     59.1M  legacy
rpool/containerd/8967                             2.14M   427G     42.0M  legacy
rpool/containerd/8968                              132K   427G     32.9M  legacy
rpool/containerd/8969                             19.9M   427G     20.7M  legacy
rpool/containerd/8970                             14.6M   427G     71.0M  legacy
rpool/containerd/8971                              716K   427G      716K  legacy
rpool/containerd/8972                             1.16M   427G     1.20M  legacy
rpool/containerd/8973                             1.35M   427G     1.35M  legacy
rpool/containerd/8974                             5.96M   427G     5.96M  legacy
rpool/containerd/8975                             24.7M   427G     72.5M  legacy
rpool/containerd/8976                             2.03M   427G     81.0M  legacy
rpool/containerd/8977                             1.81M   427G     1.86M  legacy
rpool/containerd/8978                             24.7M   427G     72.5M  legacy
rpool/containerd/8979                             1.59M   427G     72.3M  legacy
rpool/containerd/8980                              772K   427G     1.82M  legacy
rpool/containerd/8981                              392K   427G     42.0M  legacy
rpool/containerd/8982                              224K   427G     32.9M  legacy
rpool/containerd/8983                             2.70M   427G     4.39M  legacy
rpool/containerd/8984                              792K   427G     73.1M  legacy
rpool/containerd/8985                             10.3M   427G     12.0M  legacy
rpool/containerd/8986                             3.15M   427G     75.1M  legacy
rpool/containerd/8987                              792K   427G     73.1M  legacy
rpool/containerd/8988                              328K   427G     1.53M  legacy
rpool/containerd/8989                              452K   427G     81.1M  legacy
rpool/containerd/8990                              204K   427G      776K  legacy
rpool/containerd/8991                             2.71M   427G     6.96M  legacy
rpool/containerd/8992                             1.75M   427G     76.4M  legacy
rpool/containerd/8993                             12.9M   427G     18.8M  legacy
rpool/containerd/8994                             37.7M   427G     49.5M  legacy
rpool/containerd/8995                              448K   427G     73.2M  legacy
rpool/containerd/8996                              448K   427G     73.2M  legacy
rpool/containerd/8997                             5.10M   427G     6.52M  legacy
rpool/containerd/8998                             1.76M   427G     8.58M  legacy
rpool/containerd/8999                             42.2M   427G      108M  legacy
rpool/containerd/9000                              384K   427G     76.4M  legacy
rpool/containerd/9001                              160K   427G      792K  legacy
rpool/containerd/9002                              164K   427G     8.63M  legacy
rpool/containerd/9003                             5.93M   427G     5.93M  legacy
rpool/containerd/9004                             37.7M   427G     38.3M  legacy
rpool/containerd/9005                              152K   427G     76.4M  legacy
rpool/containerd/9006                             6.14M   427G     6.14M  legacy
rpool/containerd/9007                             3.40M   427G     11.9M  legacy
rpool/containerd/9008                              184K   427G     76.4M  legacy
rpool/containerd/9009                              312K   427G     18.8M  legacy
rpool/containerd/9010                              184K   427G     76.4M  legacy
rpool/containerd/9011                              392K   427G     12.0M  legacy
rpool/containerd/9012                              412K   427G     76.4M  legacy
rpool/containerd/9013                              272K   427G     6.57M  legacy
rpool/containerd/9014                             3.41M   427G     76.5M  legacy
rpool/containerd/9015                             2.35M   427G     8.14M  legacy
rpool/containerd/9016                             10.1M   427G     10.1M  legacy
rpool/containerd/9017                             1.30M   427G     76.5M  legacy
rpool/containerd/9018                             32.0M   427G     42.0M  legacy
rpool/containerd/9019                             23.2M   427G     72.6M  legacy
rpool/containerd/9020                              308K   427G     8.14M  legacy
rpool/containerd/9021                             1.45M   427G     6.95M  legacy
rpool/containerd/9022                              384K   427G     6.97M  legacy
rpool/containerd/9023                              316K   427G     7.14M  legacy
rpool/containerd/9024                              636K   427G     7.59M  legacy
rpool/containerd/9025                              216K   427G     42.0M  legacy
rpool/containerd/9026                              344K   427G     7.81M  legacy
rpool/containerd/9027                              244K   427G      244K  legacy
rpool/containerd/9028                              292K   427G     7.93M  legacy
rpool/containerd/9029                              176K   427G     7.93M  legacy
rpool/containerd/9030                              176K   427G     7.93M  legacy
rpool/containerd/9031                              120K   427G      252K  legacy
rpool/containerd/9032                             2.72M   427G     8.72M  legacy
rpool/containerd/9033                              148K   427G     7.93M  legacy
rpool/containerd/9034                             37.7M   427G     37.8M  legacy
rpool/containerd/9035                             4.63M   427G     10.5M  legacy
rpool/containerd/9036                              528K   427G     7.95M  legacy
rpool/containerd/9037                              212K   427G     10.5M  legacy
rpool/containerd/9038                             12.8M   427G     21.8M  legacy
rpool/containerd/9039                              304K   427G     21.9M  legacy
rpool/containerd/9040                             66.0M   427G     86.4M  legacy
rpool/containerd/9041                              288K   427G     20.8M  legacy
rpool/containerd/9042                              176K   427G      628K  legacy
rpool/containerd/9043                              528K   427G     7.95M  legacy
rpool/containerd/9044                              188K   427G      108M  legacy
rpool/containerd/9045                              196K   427G      108M  legacy
rpool/containerd/9046                              196K   427G      108M  legacy
rpool/containerd/9047                              196K   427G      108M  legacy
rpool/containerd/9048                             1.20M   427G      108M  legacy
rpool/containerd/9049                                8K   427G      108M  legacy
rpool/containerd/9050                                0B   427G      108M  legacy
rpool/containerd/9051                                0B   427G      108M  legacy
rpool/containerd/9052                             2.41M   427G      108M  legacy
rpool/containerd/9053                             16.8M   427G      125M  legacy
rpool/containerd/9054                              432K   427G     72.7M  legacy
rpool/containerd/9055                             10.4M   427G     48.0M  legacy
rpool/containerd/9056                              560K   427G     86.4M  legacy
rpool/containerd/9057                             10.4M   427G     48.5M  legacy
rpool/containerd/9058                             1.12M   427G      125M  legacy
rpool/containerd/9059                             74.3M   427G      122M  legacy
rpool/containerd/9060                             2.50M   427G     87.1M  legacy
rpool/containerd/9061                             30.1M   427G     78.4M  legacy
rpool/containerd/9062                              172K   427G     78.4M  legacy
rpool/containerd/9063                              124K   427G     78.4M  legacy
rpool/containerd/9064                              496K   427G     78.5M  legacy
rpool/containerd/9065                              504K   427G      122M  legacy
rpool/containerd/9066                              176K   427G      628K  legacy
rpool/containerd/9067                              528K   427G     7.95M  legacy
rpool/containerd/9068                              111M   427G      111M  legacy
rpool/containerd/9069                              176K   427G      628K  legacy
rpool/containerd/9070                              528K   427G     7.95M  legacy
rpool/containerd/9071                              168K   427G      111M  legacy
rpool/containerd/9072                             3.04M   427G      112M  legacy
rpool/containerd/9073                              176K   427G      628K  legacy
rpool/containerd/9074                              528K   427G     7.95M  legacy
rpool/containerd/9075                              176K   427G      628K  legacy
rpool/containerd/9076                              528K   427G     7.95M  legacy
rpool/containerd/9077                              176K   427G      628K  legacy
rpool/containerd/9078                              528K   427G     7.95M  legacy
rpool/containerd/9079                              176K   427G      628K  legacy
rpool/containerd/9080                              528K   427G     7.95M  legacy
rpool/containerd/9081                              176K   427G      628K  legacy
rpool/containerd/9082                              528K   427G     7.95M  legacy
rpool/containerd/9083                              176K   427G      628K  legacy
rpool/containerd/9084                              528K   427G     7.95M  legacy
rpool/containerd/9085                              176K   427G      628K  legacy
rpool/containerd/9086                              528K   427G     7.95M  legacy
rpool/containerd/9087                              176K   427G      628K  legacy
rpool/containerd/9088                              528K   427G     7.95M  legacy
rpool/containerd/9089                              176K   427G      628K  legacy
rpool/containerd/9090                              528K   427G     7.95M  legacy
rpool/containerd/9091                              176K   427G      628K  legacy
rpool/containerd/9092                              528K   427G     7.95M  legacy
rpool/containerd/9093                              176K   427G      628K  legacy
rpool/containerd/9094                              176K   427G      628K  legacy
rpool/containerd/9095                             4.46M   427G     9.50M  legacy
rpool/containerd/9096                             6.13M   427G     6.13M  legacy
rpool/containerd/9097                              392K   427G     9.52M  legacy
rpool/containerd/9098                              360K   427G     9.56M  legacy
rpool/containerd/9099                             11.0M   427G     20.0M  legacy
rpool/containerd/9100                             2.72M   427G     8.71M  legacy
rpool/containerd/9101                             4.63M   427G     10.5M  legacy
rpool/containerd/9102                              212K   427G     10.5M  legacy
rpool/containerd/9103                             12.8M   427G     21.8M  legacy
rpool/containerd/9104                              304K   427G     21.9M  legacy
rpool/containerd/9105                              104M   427G      122M  legacy
rpool/containerd/9106                              188K   427G     20.0M  legacy
rpool/containerd/9107                             21.9M   427G     40.6M  legacy
rpool/containerd/9108                              200K   427G     40.6M  legacy
rpool/containerd/9109                              544K   427G     40.6M  legacy
rpool/containerd/9110                              280K   427G     40.7M  legacy
rpool/containerd/9111                             1.66M   427G     41.8M  legacy
rpool/containerd/9112                             41.6M   427G     80.9M  legacy
rpool/containerd/9113                              484K   427G     80.9M  legacy
rpool/containerd/9114                              344M   427G      424M  legacy
rpool/containerd/9115                             16.9M   427G      139M  legacy
rpool/containerd/9116                              348K   427G      139M  legacy
rpool/containerd/9117                             2.29M   427G      139M  legacy
rpool/containerd/9118                              176K   427G      628K  legacy
rpool/containerd/9119                              176K   427G      628K  legacy
rpool/containerd/9120                             59.0M   427G     59.0M  legacy
rpool/containerd/9121                              288K   427G      424M  legacy
rpool/containerd/9122                              472K   427G      424M  legacy
rpool/containerd/9123                             67.8M   427G     67.8M  legacy
rpool/containerd/9124                             1.52M   427G      424M  legacy
rpool/containerd/9125                             17.3M   427G     22.4M  legacy
rpool/containerd/9126                              156K   427G     22.4M  legacy
rpool/containerd/9127                              236M   427G      302M  legacy
rpool/containerd/9128                              156K   427G     22.4M  legacy
rpool/containerd/9129                              156K   427G     22.4M  legacy
rpool/containerd/9130                              156K   427G     22.5M  legacy
rpool/containerd/9131                              900K   427G     22.6M  legacy
rpool/containerd/9132                              480K   427G     59.1M  legacy
rpool/containerd/9133                             35.4M   427G     91.8M  legacy
rpool/containerd/9134                             1.48M   427G     93.0M  legacy
rpool/containerd/9135                             90.6M   427G      183M  legacy
rpool/containerd/9136                             1.73M   427G      184M  legacy
rpool/containerd/9137                              424K   427G      184M  legacy
rpool/containerd/9138                              160K   427G      184M  legacy
rpool/containerd/9139                              200K   427G      184M  legacy
rpool/containerd/9140                              308K   427G      184M  legacy
rpool/containerd/9141                              460K   427G      184M  legacy
rpool/containerd/9142                             2.85M   427G      184M  legacy
rpool/containerd/9143                              294M   427G      592M  legacy
rpool/containerd/9144                             1.22M   427G      593M  legacy
rpool/containerd/9145                             18.3M   427G      611M  legacy
rpool/containerd/9146                             1.11M   427G      611M  legacy
rpool/containerd/9147                              312K   427G      611M  legacy
rpool/containerd/9148                              244K   427G      611M  legacy
rpool/containerd/9149                              252K   427G      611M  legacy
rpool/containerd/9150                              252K   427G      611M  legacy
rpool/containerd/9151                              252K   427G      611M  legacy
rpool/containerd/9152                              316K   427G      611M  legacy
rpool/containerd/9153                              348K   427G      611M  legacy
rpool/containerd/9154                              656K   427G      611M  legacy
rpool/containerd/9155                             5.88M   427G      430M  legacy
rpool/containerd/9156                             5.26M   427G      430M  legacy
rpool/containerd/9157                              176K   427G      628K  legacy
rpool/containerd/9158                              528K   427G     7.95M  legacy

How to proceed from here?

Jeroen0494 commented 2 years ago

I was able to clear the errors by issuing the following commands from this article: https://serverfault.com/a/846758

jeroen@mediaserver:~$ sudo zpool scrub rpool
jeroen@mediaserver:~$ sudo zpool scrub -s rpool
jeroen@mediaserver:~$ sudo zpool status rpool -v
  pool: rpool
 state: ONLINE
  scan: scrub canceled on Tue May  3 17:56:58 2022
config:

        NAME                                    STATE     READ WRITE CKSUM
        rpool                                   ONLINE       0     0     0
          5ca738f2-6682-b54e-a259-6dac8cafcbbb  ONLINE       0     0     0

errors: No known data errors
jeroen@mediaserver:~$ sudo zpool scrub rpool
jeroen@mediaserver:~$ sudo zpool status rpool -v
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:24 with 0 errors on Tue May  3 17:57:28 2022
config:

        NAME                                    STATE     READ WRITE CKSUM
        rpool                                   ONLINE       0     0     0
          5ca738f2-6682-b54e-a259-6dac8cafcbbb  ONLINE       0     0     0

errors: No known data errors

Still, I'd like to know how my datasets got corrupted.

mtippmann commented 2 years ago

@Jeroen0494 this is probably a zfs issue - https://github.com/openzfs/zfs/issues/12014 - could also be a hardware problem with certain ssd's described here: https://vadosware.io/post/starting-2022-with-a-bang-ceph-on-zfs/#debug-data-corruption-rears-its-head-again

Jeroen0494 commented 2 years ago

@mtippmann thanks for the reply.

You know what, that actually makes sense. My 2x WD 10TB mirror has also been giving me errors lately since I switched from CentoS 7 to Ubuntu 22.04. Both my root and data dataset are encrypted, and I use zsys and sanoid for snapshots, which is the exact scenario they describe in that bug report.

I've read that somebody reported the issues to be resolved since OpenZFS v2.1.4, I'll see if I can update.

Jeroen0494 commented 2 years ago

Also relevant: https://github.com/openzfs/zfs/issues/11688

Jeroen0494 commented 2 years ago

Since OpenZFS v2.1.4 the curruption orrurs less frequently.

containerd / zfs

Data corruption on container image storage #60