This repository contains the code used to build the Ubuntu-based Delphix Appliance, leveraging open-source tools such as Debian's live-build, Docker, Ansible, OpenZFS, and others.
Apache License 2.0
19
stars
41
forks
source link
DLPX-86854 Post-upgrade cleanup task fails with internal error, due to attempting to delete dataset which has already been deleted #730
See this Jira comment for RCA: https://delphix.atlassian.net/browse/DLPX-86854?focusedCommentId=702681.
Essentially, this code lists all snapshots and filesystems of the rpool/ROOT and then attempts to destroy them if they are from at least 2 versions before the current one. This code also iterates over these filesystems and snapshots in this order:
```
for rootfs in sorted(filesystems + snapshots, key=rootfscmp)[:-2]:
```
So the filesystem (along with all snapshots because of the use of the -r option) was destroyed before the snapshot causing this error.
Solution
Add a condition to check if a FS or snapshot exists before attempting to destroy.
Testing Done
I tested this by manually running the script on an engine. To test this, I created a couple of filesystems and snapshots.
```
delphix@pg-DLPX-86854:~$ sudo zfs get -Hpo value com.delphix:current-version rpool/ROOT/delphix.x8ZpkSW
14.0.0.0-snapshot.20230716095228662+jenkins-ops-appliance-build-develop-post-push-501
// Clone a root FS container and it's root FS.
$ sudo zfs snapshot rpool/ROOT/delphix.x8ZpkSW@init-container-snap
$ sudo zfs snapshot rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap
// 13.0 FS and snapshot
$ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.pgandhi
$ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap rpool/ROOT/delphix.pgandhi/root
$ sudo zfs set com.delphix:current-version=13.0.0.0 rpool/ROOT/delphix.pgandhi
$ sudo zfs snapshot rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn
// 12.0 FS and snapshot
$ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.pgasd12
$ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap rpool/ROOT/delphix.pgasd12/root
$ sudo zfs set com.delphix:current-version=12.0.0.0 rpool/ROOT/delphix.pgasd12
$ sudo zfs snapshot rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn
// 11.0 FS and snapshot
$ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.pgasd11
$ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap rpool/ROOT/delphix.pgasd11/root
$ sudo zfs set com.delphix:current-version=11.0.0.0 rpool/ROOT/delphix.pgasd11
$ sudo zfs snapshot rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 10.7G 56.7G 64K none
rpool/ROOT 10.7G 56.7G 64K none
rpool/ROOT/delphix.pgandhi 1K 56.7G 64K none
rpool/ROOT/delphix.pgandhi/root 1K 56.7G 8.45G none
rpool/ROOT/delphix.pgasd11 1K 56.7G 64K none
rpool/ROOT/delphix.pgasd11/root 1K 56.7G 8.45G none
rpool/ROOT/delphix.pgasd12 1K 56.7G 64K none
rpool/ROOT/delphix.pgasd12/root 1K 56.7G 8.45G none
rpool/ROOT/delphix.x8ZpkSW 10.7G 56.7G 64K none
rpool/ROOT/delphix.x8ZpkSW/data 521M 56.7G 521M legacy
rpool/ROOT/delphix.x8ZpkSW/home 1.69G 56.7G 1.69G legacy
rpool/ROOT/delphix.x8ZpkSW/log 10.4M 56.7G 10.4M legacy
rpool/ROOT/delphix.x8ZpkSW/root 8.45G 56.7G 8.45G /
$ zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn 0B - 64K -
rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn 0B - 64K -
rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn 0B - 64K -
rpool/ROOT/delphix.x8ZpkSW@init-container-snap 0B - 64K -
rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap 210K - 8.45G -
```
(correction) - I tried running the script and realized that the script will fail because I have a snapshot not conforming with the script so I had to rename `init-container-snap`:
```
$ sudo zfs rename rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.x8ZpkSW@execute-upgrade.pg140sn
```
Next, I ran the `rootfs-cleanup`:
```
delphix@pg-DLPX-86854:~$ sudo ./rootfs-cleanup
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/delphix.pgasd11 1K 56.7G 64K none
cannot open 'rpool/ROOT/delphix.pgasd11/data': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgasd11/home': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgasd11/log': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn': dataset does not exist
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/delphix.pgasd12 1K 56.7G 64K none
cannot open 'rpool/ROOT/delphix.pgasd12/data': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgasd12/home': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgasd12/log': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn': dataset does not exist
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/delphix.pgandhi 1K 56.7G 64K none
cannot open 'rpool/ROOT/delphix.pgandhi/data': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgandhi/home': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgandhi/log': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn': dataset does not exist
delphix@pg-DLPX-86854:~$ echo $?
0
```
Result:
```
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 10.7G 56.7G 64K none
rpool/ROOT 10.7G 56.7G 64K none
rpool/ROOT/delphix.x8ZpkSW 10.7G 56.7G 64K none
rpool/ROOT/delphix.x8ZpkSW/data 521M 56.7G 521M legacy
rpool/ROOT/delphix.x8ZpkSW/home 1.69G 56.7G 1.69G legacy
rpool/ROOT/delphix.x8ZpkSW/log 11.2M 56.7G 11.2M legacy
rpool/ROOT/delphix.x8ZpkSW/root 8.45G 56.7G 8.45G /
rpool/crashdump 29K 34.7G 29K legacy
rpool/docker 320K 56.7G 320K -
rpool/grub 3.08M 56.7G 3.08M legacy
rpool/public 29K 56.7G 29K /public
rpool/update 31K 30.0G 31K /var/dlpx-update
rpool/upgrade-logs 29K 56.7G 29K /var/tmp/delphix-upgrade
$ zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/delphix.x8ZpkSW@execute-upgrade.pg140sn 0B - 64K -
rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap 210K - 8.45G -
```
The 13.0 snapshot `execute-upgrade.pg130sn` was correctly destroyed because I had an extra snapshot of the current root container:
```
-> for rootfs in sorted(filesystems + snapshots, key=rootfscmp)[:-2]:
(Pdb) sorted(filesystems + snapshots, key=rootfscmp)
['rpool/ROOT/delphix.pgasd11', 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn', 'rpool/ROOT/delphix.pgasd12', 'rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn', 'rpool/ROOT/delphix.pgandhi', 'rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn', 'rpool/ROOT/delphix.x8ZpkSW', 'rpool/ROOT/delphix.x8ZpkSW@execute-upgrade.pg140sn']
```
Without this fix:
```
delphix@pg-DLPX-86854:~$ sudo ./rootfs-cleanup
cannot open 'rpool/ROOT/delphix.pgasd11/data': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgasd11/home': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgasd11/log': dataset does not exist
cannot open 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn': dataset does not exist
Traceback (most recent call last):
File "./rootfs-cleanup", line 174, in
main()
File "./rootfs-cleanup", line 134, in main
if dpkgcmp(version(rootfs), "ge", current):
File "./rootfs-cleanup", line 24, in version
subprocess.check_output([
File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['zfs', 'get', '-Hpo', 'value', 'com.delphix:current-version', 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn']' returned non-zero exit status 1.
```
(note that that line numbers in the stacktrace are a bit different from the stacktrace of the original bug because I simply removed the `if not exists(rootfs)` conditional block and let the function `exists()` remain in the script.
See this Jira comment for RCA: https://delphix.atlassian.net/browse/DLPX-86854?focusedCommentId=702681. Essentially, this code lists all snapshots and filesystems of the rpool/ROOT and then attempts to destroy them if they are from at least 2 versions before the current one. This code also iterates over these filesystems and snapshots in this order: ``` for rootfs in sorted(filesystems + snapshots, key=rootfscmp)[:-2]: ``` So the filesystem (along with all snapshots because of the use of the -r option) was destroyed before the snapshot causing this error.Problem
Add a condition to check if a FS or snapshot exists before attempting to destroy.Solution
I tested this by manually running the script on an engine. To test this, I created a couple of filesystems and snapshots. ``` delphix@pg-DLPX-86854:~$ sudo zfs get -Hpo value com.delphix:current-version rpool/ROOT/delphix.x8ZpkSW 14.0.0.0-snapshot.20230716095228662+jenkins-ops-appliance-build-develop-post-push-501 // Clone a root FS container and it's root FS. $ sudo zfs snapshot rpool/ROOT/delphix.x8ZpkSW@init-container-snap $ sudo zfs snapshot rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap // 13.0 FS and snapshot $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.pgandhi $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap rpool/ROOT/delphix.pgandhi/root $ sudo zfs set com.delphix:current-version=13.0.0.0 rpool/ROOT/delphix.pgandhi $ sudo zfs snapshot rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn // 12.0 FS and snapshot $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.pgasd12 $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap rpool/ROOT/delphix.pgasd12/root $ sudo zfs set com.delphix:current-version=12.0.0.0 rpool/ROOT/delphix.pgasd12 $ sudo zfs snapshot rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn // 11.0 FS and snapshot $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.pgasd11 $ sudo zfs clone rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap rpool/ROOT/delphix.pgasd11/root $ sudo zfs set com.delphix:current-version=11.0.0.0 rpool/ROOT/delphix.pgasd11 $ sudo zfs snapshot rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn $ zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 10.7G 56.7G 64K none rpool/ROOT 10.7G 56.7G 64K none rpool/ROOT/delphix.pgandhi 1K 56.7G 64K none rpool/ROOT/delphix.pgandhi/root 1K 56.7G 8.45G none rpool/ROOT/delphix.pgasd11 1K 56.7G 64K none rpool/ROOT/delphix.pgasd11/root 1K 56.7G 8.45G none rpool/ROOT/delphix.pgasd12 1K 56.7G 64K none rpool/ROOT/delphix.pgasd12/root 1K 56.7G 8.45G none rpool/ROOT/delphix.x8ZpkSW 10.7G 56.7G 64K none rpool/ROOT/delphix.x8ZpkSW/data 521M 56.7G 521M legacy rpool/ROOT/delphix.x8ZpkSW/home 1.69G 56.7G 1.69G legacy rpool/ROOT/delphix.x8ZpkSW/log 10.4M 56.7G 10.4M legacy rpool/ROOT/delphix.x8ZpkSW/root 8.45G 56.7G 8.45G / $ zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn 0B - 64K - rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn 0B - 64K - rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn 0B - 64K - rpool/ROOT/delphix.x8ZpkSW@init-container-snap 0B - 64K - rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap 210K - 8.45G - ``` (correction) - I tried running the script and realized that the script will fail because I have a snapshot not conforming with the script so I had to rename `init-container-snap`: ``` $ sudo zfs rename rpool/ROOT/delphix.x8ZpkSW@init-container-snap rpool/ROOT/delphix.x8ZpkSW@execute-upgrade.pg140sn ``` Next, I ran the `rootfs-cleanup`: ``` delphix@pg-DLPX-86854:~$ sudo ./rootfs-cleanup NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.pgasd11 1K 56.7G 64K none cannot open 'rpool/ROOT/delphix.pgasd11/data': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11/home': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11/log': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn': dataset does not exist NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.pgasd12 1K 56.7G 64K none cannot open 'rpool/ROOT/delphix.pgasd12/data': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd12/home': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd12/log': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn': dataset does not exist NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.pgandhi 1K 56.7G 64K none cannot open 'rpool/ROOT/delphix.pgandhi/data': dataset does not exist cannot open 'rpool/ROOT/delphix.pgandhi/home': dataset does not exist cannot open 'rpool/ROOT/delphix.pgandhi/log': dataset does not exist cannot open 'rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn': dataset does not exist delphix@pg-DLPX-86854:~$ echo $? 0 ``` Result: ``` $ zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 10.7G 56.7G 64K none rpool/ROOT 10.7G 56.7G 64K none rpool/ROOT/delphix.x8ZpkSW 10.7G 56.7G 64K none rpool/ROOT/delphix.x8ZpkSW/data 521M 56.7G 521M legacy rpool/ROOT/delphix.x8ZpkSW/home 1.69G 56.7G 1.69G legacy rpool/ROOT/delphix.x8ZpkSW/log 11.2M 56.7G 11.2M legacy rpool/ROOT/delphix.x8ZpkSW/root 8.45G 56.7G 8.45G / rpool/crashdump 29K 34.7G 29K legacy rpool/docker 320K 56.7G 320K - rpool/grub 3.08M 56.7G 3.08M legacy rpool/public 29K 56.7G 29K /public rpool/update 31K 30.0G 31K /var/dlpx-update rpool/upgrade-logs 29K 56.7G 29K /var/tmp/delphix-upgrade $ zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/delphix.x8ZpkSW@execute-upgrade.pg140sn 0B - 64K - rpool/ROOT/delphix.x8ZpkSW/root@init-root-fs-snap 210K - 8.45G - ``` The 13.0 snapshot `execute-upgrade.pg130sn` was correctly destroyed because I had an extra snapshot of the current root container: ``` -> for rootfs in sorted(filesystems + snapshots, key=rootfscmp)[:-2]: (Pdb) sorted(filesystems + snapshots, key=rootfscmp) ['rpool/ROOT/delphix.pgasd11', 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn', 'rpool/ROOT/delphix.pgasd12', 'rpool/ROOT/delphix.pgasd12@execute-upgrade.pg120sn', 'rpool/ROOT/delphix.pgandhi', 'rpool/ROOT/delphix.pgandhi@execute-upgrade.pg130sn', 'rpool/ROOT/delphix.x8ZpkSW', 'rpool/ROOT/delphix.x8ZpkSW@execute-upgrade.pg140sn'] ``` Without this fix: ``` delphix@pg-DLPX-86854:~$ sudo ./rootfs-cleanup cannot open 'rpool/ROOT/delphix.pgasd11/data': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11/home': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11/log': dataset does not exist cannot open 'rpool/ROOT/delphix.pgasd11@execute-upgrade.pg110sn': dataset does not exist Traceback (most recent call last): File "./rootfs-cleanup", line 174, inTesting Done