LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
984 stars 76 forks source link

Stuck Linstor-to-Linstor backup and phantom snapshots #315

Open stryan opened 2 years ago

stryan commented 2 years ago

Hi all, I've setup a testing linstore environment and run into some issues with shipping snapshots to a secondary cluster. For testing I did the following steps:

  1. Create a test group and volume testgroup_vol1
  2. Add remote linstor cluster "remote-ls03"
  3. Ran backup ship remote-ls03 testgroup_vol1 testgroup_vol1_remote

This error'd out in a similar fashion to #303 where I had backups made that I couldn't delete and running backup abort reported success regardless of what happened.

Read though 303, readded remotes with specified cluster ID's, and tried again with a seperate resource to no avail. Now I have two sets of frozen backups with the snapshots living one one node, ls01. Since I can't delete the snapshots or abort the backup I tried the following steps

  1. Restarted both the source and remote linstor-controllers
  2. Restarted both the source and target hostsw
  3. Evacuated and restored the ls01 node in an attempt to dump the snapshot
  4. Shut off linstor-satellite on ls01, node lost ls01 on the source controller, then readded it

Which has lead me to the current state where I now have two backup snapshots not attached to any node. I am unable to either abort the backup or delete the snapshots by hand. Both snapshots only show up on the source cluster and are in state "succesful" i.e:

example2       ┊ back_20220916_201352 ┊           ┊ 0: 64 MiB, 1: 2 GiB ┊           ┊ Successful ┊

Is there anyway for me to remove these phantom snapshots without having to reinitialize the cluster?

Enviornment: Ubuntu 22.04, Linstor installed from PPA's Linstor version: 1.19.1-1ppa1~jammy1 DRBD: 9.1.11 Source Cluster: 1 combined node and two satellite nodes, with linstor-gateway running Remote cluster: 1 combind node.

stryan commented 2 years ago

Testing on 1.20rc1 shows this issue is mostly resolved. After upgrading both clusters and re-adding the remote I was able to ship a snapshot to the remote cluster. However I still ended up with stuck snapshots when there was an issue contacting the remote cluster. I had to restart the controller on the source cluster before being able to abort the backup and remove the backup snapshot.

ghernadi commented 2 years ago

Glad to hear 1.20.0rc1 works better - we did a rework in the abort mechanism to be more stable.

Would be helpful to have more details of what exactly happened when you ended up with a not abort-able backup? Was the target cluster offline from the beginning or a network-hiccup in the beginning of the backup shipment or during the data transfer?

stryan commented 2 years ago

Ah, actually looking through my scrollback the backup attempt that created phantom snapshots was due to me using "backup create" instead of "backup ship" for a Linstor to Linstor backup. It might not have been a network issue at all; I'll see if I can recreate again, but otherwise it might just need better safeguards on trying to create a backup on a linstor cluster. I will admit its a bit confusing that S3 backups and Linstor backups are both done through the backup command but work differently (i.e. backup list only works on S3 targets).