ganeti / instance-debootstrap

Debootstrap instance OS (migrated from http://git.ganeti.org/?p=instance-debootstrap.git;a=summary)
GNU General Public License v2.0
4 stars 13 forks source link

export fails on swap partition #18

Open anarcat opened 2 years ago

anarcat commented 2 years ago

it looks like ganeti is completely failing to export VMs that have a swap device. I created this instance with:

    gnt-instance add \
      -o debootstrap+bullseye \
      -t drbd --no-wait-for-sync \
      --net 0:ip=pool,network=gnt-fsn13-02 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --disk 1:size=2G,name=swap \
      --backend-parameters memory=1g,vcpus=1 \
      test-01.torproject.org

the swap is created through this hook, which basically does this:

    swapdev=$(eval "echo \$DISK_${i}_PATH")
    mkswap "$swapdev"

when i try to export this instance, ganeti fails with:

root@fsn-node-02:~# gnt-backup export -n fsn-node-02.torproject.org test-01.torproject.org
Mon Oct  3 20:35:00 2022 Shutting down instance test-01.torproject.org
Mon Oct  3 20:35:05 2022 Creating a snapshot of disk/0 on node fsn-node-01.torproject.org
Mon Oct  3 20:35:06 2022 Creating a snapshot of disk/1 on node fsn-node-01.torproject.org
Mon Oct  3 20:35:07 2022 Starting instance test-01.torproject.org
Mon Oct  3 20:35:08 2022 Exporting snapshot/0 from fsn-node-01.torproject.org to fsn-node-02.torproject.org
Mon Oct  3 20:35:08 2022 Exporting snapshot/1 from fsn-node-01.torproject.org to fsn-node-02.torproject.org
Mon Oct  3 20:35:11 2022 snapshot/0 is now listening, starting export
Mon Oct  3 20:35:11 2022 snapshot/1 is now listening, starting export
Mon Oct  3 20:35:14 2022 snapshot/0 is receiving data on fsn-node-02.torproject.org
Mon Oct  3 20:35:14 2022 snapshot/0 is sending data on fsn-node-01.torproject.org
Mon Oct  3 20:35:14 2022  - WARNING: export 'export-disk1-2022-10-03_20_35_14-g26kw0j_' on fsn-node-01.torproject.org failed: Exited with status 1
Mon Oct  3 20:35:14 2022 snapshot/1 failed to send data: Exited with status 1 (recent output: Cannot interpret kpartx output and get partition mapping\ndd: 0 bytes copied, 0.00339349 s, 0.0 kB/s)
Mon Oct  3 20:35:14 2022 Removing snapshot of disk/1 on node fsn-node-01.torproject.org
Mon Oct  3 20:35:15 2022  - WARNING: Aborting import 'import-disk1-2022-10-03_20_35_08-j2owrsjk' on 053e482a-c9f9-49a1-984d-50ae5b4563e6
Mon Oct  3 20:35:16 2022 snapshot/1 finished receiving data
Mon Oct  3 20:35:21 2022 snapshot/0 sent 513M, 89.6 MiB/s
Mon Oct  3 20:35:26 2022 snapshot/0 finished receiving data
Mon Oct  3 20:35:26 2022 snapshot/0 finished sending data
Mon Oct  3 20:35:26 2022 Removing snapshot of disk/0 on node fsn-node-01.torproject.org
Mon Oct  3 20:35:27 2022  - WARNING: Some disk exports have failed; there may be leftover data for instance test-01.torproject.org on node fsn-node-02.torproject.org
Failure: command execution error:
Export failed, errors in export finalization, disk export: disk(s) 1

it looks like it's failing on the dreaded map_disk0 code from common.sh:

https://github.com/ganeti/instance-debootstrap/blob/e0df6b1fd25dc3e111851ae42872df0a757ac4a9/common.sh.in#L110-L126

specifically, the output of this command is empty:

kpartx -l -p-part $blockdev | \ 
                        grep -m 1 -- "-part1 : .*$blockdev" | \ 
                        awk '{print $1}'

if i kill the second disk, the backup works.

so i guess the question is: are backups just broken with multiple devices? or is this specific to swap?

later code in the export function certainly has me worried about backing up anything but a raw ext2+ partition here:

https://github.com/ganeti/instance-debootstrap/blob/e0df6b1fd25dc3e111851ae42872df0a757ac4a9/export#L46-L49

... why don't we just dd the heck out of this anyways, as a fallback?

rbott commented 1 year ago

I would actually opt to completly ditch the current import/export scripts and replace them with the versions from the noop OS provider, which essentially uses dd to make a bitwise copy of the disk.

This of course creates quite large exports and also exports potentially useless data (e.g. swap), but is guaranteed to work in all cases. The current logic is very prone to fail in various scenarios (differing disk layouts, xfs or other non-ext-filesystems etc.).

@saschalucas @apoikos @atta what do you think about this?

anarcat commented 1 year ago

it's effectively what we did here.

atta commented 1 year ago

@rbott yes i'm a 100% for you suggestion i can prepare a pull-request in order to get this common issue out of the way