LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
984 stars 76 forks source link

Snapshot rollback failed #418

Open 1563932024 opened 2 months ago

1563932024 commented 2 months ago

When the node unexpectedly shuts down and a snapshot is created, after some time the node returns to normal, but the snapshot rollback fails at this point.

[root@stor1 ~]# linstor n l
╭────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node  ┊ NodeType  ┊ Addresses               ┊ State                                        ┊
╞════════════════════════════════════════════════════════════════════════════════════════════╡
┊ stor1 ┊ SATELLITE ┊ 10.0.0.225:3366 (PLAIN) ┊ Online                                       ┊
┊ stor2 ┊ SATELLITE ┊ 10.0.0.170:3366 (PLAIN) ┊ Online                                       ┊
┊ stor3 ┊ SATELLITE ┊ 10.0.0.240:3366 (PLAIN) ┊ OFFLINE (Auto-eviction: 2024-09-03 17:38:12) ┊
╰────────────────────────────────────────────────────────────────────────────────────────────╯
To cancel automatic eviction please consider the corresponding DrbdOptions/AutoEvict* properties on controller and / or node level
See 'linstor controller set-property --help' or 'linstor node set-property --help' for more details
[root@stor1 ~]#
[root@stor1 ~]#
[root@stor1 ~]#
[root@stor1 ~]# linstor s c test1 snapshot2
WARNING:
    Snapshot for resource 'test1' will not be created on node 'stor3' because that node is currently offline.
SUCCESS:
Description:
    New snapshot 'snapshot2' of resource 'test1' registered.
Details:
    Snapshot 'snapshot2' of resource 'test1' UUID is: 93d33a54-c354-41b7-9d4d-f2b9611c5388
SUCCESS:
    (stor2) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    Suspended IO of '[test1]' on 'stor2' for snapshot
SUCCESS:
    (stor1) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    Suspended IO of '[test1]' on 'stor1' for snapshot
SUCCESS:
    (stor1) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    (stor1) Snapshot [ZFS-Thin] with name 'snapshot2' of resource 'test1', volume number 0 created.
SUCCESS:
    Took snapshot of '[test1]' on 'stor1'
SUCCESS:
    (stor2) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    (stor2) Snapshot [ZFS-Thin] with name 'snapshot2' of resource 'test1', volume number 0 created.
SUCCESS:
    Took snapshot of '[test1]' on 'stor2'
SUCCESS:
    (stor2) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    Resumed IO of '[test1]' on 'stor2' after snapshot
SUCCESS:
    (stor1) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    Resumed IO of '[test1]' on 'stor1' after snapshot
[root@stor1 ~]# linstor s l
╭───────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ SnapshotName   ┊ NodeNames           ┊ Volumes  ┊ CreatedOn           ┊ State      ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ test1        ┊ snapshot2      ┊ stor1, stor2        ┊ 0: 5 GiB ┊ 2024-09-03 17:28:37 ┊ Successful ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯

At this point, the node has returned to normal, but the snapshot rollback fails.

[root@stor1 ~]# linstor r l
╭───────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════╡
┊ test1        ┊ stor1 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2024-09-03 14:28:07 ┊
┊ test1        ┊ stor2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2024-09-03 14:28:07 ┊
┊ test1        ┊ stor3 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2024-09-03 14:28:07 ┊
╰───────────────────────────────────────────────────────────────────────────────╯
[root@stor1 ~]# linstor s rb test1 snapshot2
ERROR:
Description:
    Snapshot 'snapshot2' of resource 'test1' on node 'stor3' not found.
Details:
    Resource: test1, Snapshot: snapshot2
Show reports:
    linstor error-reports show 66D6C9AC-00000-000000
ghernadi commented 2 months ago

Hello,

Yes, there are some known limitations of the rollback implementation. We have already a few ideas how this could be improved in the future.

For now, what you can do is to delete the resource temporarily from stor3 node, run the rollback command and re-create the resource on stor3, which will receive the (rolled back) data from the other two nodes.

Alternatively, instead of rollback you could also restore the given snapshot into a new resource, but this approach might not fit your use-case.

1563932024 commented 2 months ago

Thank you for your reply.

I'm very interested in Linstor. Could you please share the approach and plan for addressing this issue (approximately when it will be fixed)?