delphix / appliance-build

This repository contains the code used to build the Ubuntu-based Delphix Appliance, leveraging open-source tools such as Debian's live-build, Docker, Ansible, OpenZFS, and others.
Apache License 2.0
19 stars 41 forks source link

DLPX-86854 Post-upgrade cleanup task fails with internal error, due to attempting to delete dataset which has already been deleted #730

Closed palash-gandhi closed 1 year ago

palash-gandhi commented 1 year ago

Problem

See this Jira comment for RCA: https://delphix.atlassian.net/browse/DLPX-86854?focusedCommentId=702681. Essentially, this code lists all snapshots and filesystems of the rpool/ROOT and then attempts to destroy them if they are from at least 2 versions before the current one. This code also iterates over these filesystems and snapshots in this order: ``` for rootfs in sorted(filesystems + snapshots, key=rootfscmp)[:-2]: ``` So the filesystem (along with all snapshots because of the use of the -r option) was destroyed before the snapshot causing this error.

Solution

Add a condition to check if a FS or snapshot exists before attempting to destroy.

Testing Done

I tested this by manually running the script on an engine. To test this, I created a couple of filesystems and snapshots. ``` delphix@pg-DLPX-86854:~$ sudo zfs get -Hpo value com.delphix:current-version rpool/ROOT/delphix.x8ZpkSW 14.0.0.0-snapshot.20230716095228662+jenkins-ops-appliance-build-develop-post-push-501 // Clone a root FS container and it's root FS. $ sudo zfs snapshot rpool/ROOT/delphix.x8ZpkSW@init-container-snap $ sudo zfs snapshot rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap // 13.0 FS and snapshot $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.pgandhi $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap rpool/ROOT/delphix.pgandhi/root $ sudo zfs set com.delphix:current-version=13.0.0.0 rpool/ROOT/delphix.pgandhi $ sudo zfs snapshot rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn // 12.0 FS and snapshot $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.pgasd12 $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap rpool/ROOT/delphix.pgasd12/root $ sudo zfs set com.delphix:current-version=12.0.0.0 rpool/ROOT/delphix.pgasd12 $ sudo zfs snapshot rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn // 11.0 FS and snapshot $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.pgasd11 $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap rpool/ROOT/delphix.pgasd11/root $ sudo zfs set com.delphix:current-version=11.0.0.0 rpool/ROOT/delphix.pgasd11 $ sudo zfs snapshot rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn $ zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 10.7G 56.7G 64K none rpool/ROOT 10.7G 56.7G 64K none rpool/ROOT/delphix.pgandhi 1K 56.7G 64K none rpool/ROOT/delphix.pgandhi/root 1K 56.7G 8.45G none rpool/ROOT/delphix.pgasd11 1K 56.7G 64K none rpool/ROOT/delphix.pgasd11/root 1K 56.7G 8.45G none rpool/ROOT/delphix.pgasd12 1K 56.7G 64K none rpool/ROOT/delphix.pgasd12/root 1K 56.7G 8.45G none rpool/ROOT/delphix.x8ZpkSW 10.7G 56.7G 64K none rpool/ROOT/delphix.x8ZpkSW/data 521M 56.7G 521M legacy rpool/ROOT/delphix.x8ZpkSW/home 1.69G 56.7G 1.69G legacy rpool/ROOT/delphix.x8ZpkSW/log 10.4M 56.7G 10.4M legacy rpool/ROOT/delphix.x8ZpkSW/root 8.45G 56.7G 8.45G / $ zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn 0B - 64K - rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn 0B - 64K - rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn 0B - 64K - rpool/ROOT/delphix.x8ZpkSW@init-container-snap 0B - 64K - rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap 210K - 8.45G - ``` (correction) - I tried running the script and realized that the script will fail because I have a snapshot not conforming with the script so I had to rename `init-container-snap`: ``` $ sudo zfs rename rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.x8ZpkSW@execute-upgrade.pg140sn ``` Next, I ran the `rootfs-cleanup`: ``` delphix@pg-DLPX-86854:~$ sudo ./rootfs-cleanup NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.pgasd11 1K 56.7G 64K none cannot open 'rpool/ROOT/delphix.pgasd11/data': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11/home': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11/log': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn': dataset does not exist NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.pgasd12 1K 56.7G 64K none cannot open 'rpool/ROOT/delphix.pgasd12/data': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd12/home': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd12/log': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn': dataset does not exist NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.pgandhi 1K 56.7G 64K none cannot open 'rpool/ROOT/delphix.pgandhi/data': dataset does not exist cannot open 'rpool/ROOT/delphix.pgandhi/home': dataset does not exist cannot open 'rpool/ROOT/delphix.pgandhi/log': dataset does not exist cannot open 'rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn': dataset does not exist delphix@pg-DLPX-86854:~$ echo $? 0 ``` Result: ``` $ zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 10.7G 56.7G 64K none rpool/ROOT 10.7G 56.7G 64K none rpool/ROOT/delphix.x8ZpkSW 10.7G 56.7G 64K none rpool/ROOT/delphix.x8ZpkSW/data 521M 56.7G 521M legacy rpool/ROOT/delphix.x8ZpkSW/home 1.69G 56.7G 1.69G legacy rpool/ROOT/delphix.x8ZpkSW/log 11.2M 56.7G 11.2M legacy rpool/ROOT/delphix.x8ZpkSW/root 8.45G 56.7G 8.45G / rpool/crashdump 29K 34.7G 29K legacy rpool/docker 320K 56.7G 320K - rpool/grub 3.08M 56.7G 3.08M legacy rpool/public 29K 56.7G 29K /public rpool/update 31K 30.0G 31K /var/dlpx-update rpool/upgrade-logs 29K 56.7G 29K /var/tmp/delphix-upgrade $ zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.x8ZpkSW@execute-upgrade.pg140sn 0B - 64K - rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap 210K - 8.45G - ``` The 13.0 snapshot `execute-upgrade.pg130sn` was correctly destroyed because I had an extra snapshot of the current root container: ``` -> for rootfs in sorted(filesystems + snapshots, key=rootfscmp)[:-2]: (Pdb) sorted(filesystems + snapshots, key=rootfscmp) ['rpool/ROOT/delphix.pgasd11', 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn', 'rpool/ROOT/delphix.pgasd12', 'rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn', 'rpool/ROOT/delphix.pgandhi', 'rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn', 'rpool/ROOT/delphix.x8ZpkSW', 'rpool/ROOT/delphix.x8ZpkSW@execute-upgrade.pg140sn'] ``` Without this fix: ``` delphix@pg-DLPX-86854:~$ sudo ./rootfs-cleanup cannot open 'rpool/ROOT/delphix.pgasd11/data': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11/home': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11/log': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn': dataset does not exist Traceback (most recent call last): File "./rootfs-cleanup", line 174, in main() File "./rootfs-cleanup", line 134, in main if dpkgcmp(version(rootfs), "ge", current): File "./rootfs-cleanup", line 24, in version subprocess.check_output([ File "/usr/lib/python3.8/subprocess.py", line 415, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['zfs', 'get', '-Hpo', 'value', 'com.delphix:current-version', 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn']' returned non-zero exit status 1. ``` (note that that line numbers in the stacktrace are a bit different from the stacktrace of the original bug because I simply removed the `if not exists(rootfs)` conditional block and let the function `exists()` remain in the script.