[storage] anvil-manage-server-storage must be able to handle drbd resync during grow

fabbione commented 1 month ago

This is not a super common situation, but regardless it needs to be handled properly or storage is leaked during grow processes.

create a server, stop the server to resize root disk (this can happen on any disk, in my test i only had one disk).

Run for the first time: anvil-manage-server-storage --server an-test-deploy1 --grow 5G --disk vda --confirm .... Done!

wait for drbd resync to be completed <-- IMPORTANT. All good, you can issue again:

anvil-manage-server-storage --server an-test-deploy1 --grow 5G --disk vda --confirm .... Done!

and it will work as expected.

wait for drbd resync to be completed <-- IMPORTANT. All good, you can issue:

anvil-manage-server-storage --server an-test-deploy1 --grow 30G --disk vda --confirm ... Done!

and issue the same command IMMEDIATELY after:

# anvil-manage-server-storage --server an-test-deploy1 --grow 30G --disk vda --confirm
Working with the server: [an-test-deploy1], UUID: [d5af3b99-8e57-418f-99d6-90f74372ff78]
- Target: [vda], boot: [01], path: [/dev/drbd/by-res/an-test-deploy1/0], Available space: [130.00 GiB]
- Preparing to grow the storage by: [30.00GiB]...
 - Extending local LV: [/dev/anvil-test-vg/an-test-deploy1_0]...
Done!
 - Extending peer: [an-a01n02:/dev/anvil-test-vg/an-test-deploy1_0], via: [10.201.10.2 (bcn1)]
Done!
- Extending backing devices complete. Now extending DRBD resource/volume...
 Error!
[ Failed ] - When trying to grow the DRBD device: [an-test-deploy1/0]
[ Failed ] - using the command: [/usr/sbin/drbdadm resize an-test-deploy1/0]
[ Failed ] - The return code: [10] was received, expected '0'. Output, if any:
==========
print $output!#
==========
The extension of the resource is incomplete, manual intervention is required!!
[ Note ] - All backing devices have been grown. Manually resolving the drbd grow
[ Note ] - error should complete the drive expansion!

This issue is caused by drbd resource refusing a resize one is already in flight. At this point we are leaking storage.

The lv has been resized, but drbd will not see it or recognize it.

Storage is leaked any time a drbd resize request fails, this is just one possible trigger.

For the grow operation specifically, either check drbd status BEFORE resizing the lv and exit 1 if in progress (avoid leaking) or a loop is necessary to wait for the first sync to complete before issuing the next resize.

digimer commented 1 month ago

What do you mean by “leaking storage”?

fabbione commented 1 month ago

Simple, the lv is resized, but not the drbd device. That means the VMs doesn´t see the storage but it is allocated in the lv/lvm. That storage is unavailable to anyone to use.

digimer commented 1 month ago

Ah, that is expected. There's a period of time where it's unavoidable that one LV is grown before the peer node is grown, and DRBD can't be grown until both are grown. If I've started a grow operation, I don't want that space to be available to others to use. The scan-lvm scan agent should see the reduced free space in the VG and drop the available space in the associated storage group.

fabbione commented 1 month ago

That is NOT the issue. The issue is that lv is grown (correctly), second drbd resize fails, nothing is going to trigger another drbd resize to match the new lv size. Hence the space is lost.

digimer commented 1 month ago

Aaaah, ok, sorry I misunderstood.

digimer commented 1 month ago

ToDo:

Check the drbd device size and compare against LV size when doing resize, and make sure all space is used. If not, do a grow.
On resize, don't even start a resize operation until both/all DRBD resources are UpToDate

Don't allow resize job to start until all nodes are online (no other way to ensure UpToDate on all DRBD nodes)

ClusterLabs / anvil

[storage] anvil-manage-server-storage must be able to handle drbd resync during grow #748