Scheduled backup shipping (S3/Linstor) is unusable

mbleichner commented 2 years ago

I'm trying to set up an off-site backup and so far haven't gotten anything to work correctly. The goal is to create a backup destination separate from the main cluster where I can store snapshots of my volumes. The backup site needs to be able to function without network access to the main cluster, so that we have emergency access to our data if the cluster somehow breaks down.

Here's my experience so far:

S3 Backup (MinIO Server)

I did the setup according to docs; connection to MinIO is working successfully
Configured a backup schedule and enabled it for the whole controller (~20 volumes for my test, between 1 and 20 GB)
As soon as the backup starts, the whole cluster basically freezes - linstor client only works intermittently and becomes unresponsive for up to 60s
linstor snapshot list gets flooded with Satellite disconnected errors
linstor volume list randomly shows state=Unknown
Controller shows a lot of errors like: "RetryTask: Failed resource 'pvc-f9fceab4-e594-4757-ab45-47990b523711' of node 'k8s-daron' added for retry."
Satellite logs show nothing at all
linstor error-reports list shows nothing
The whole procedure continues for a good while, with the cluster retrying and retrying snapshots, freezing over and over again
Every now and then, a backup can actually be shipped, but it takes a lot of retries.
Eventually the backup shipping kind of "worked", but it took an eternity and the freezes are unacceptable. linstor snapshot list is flooded with errors.

Linstor Backup (to a separate single node Linstor server)

I did the setup according to docs, linstor remote create linstor ... showed no error
linstor backup list k8s-backup-linstor => ERROR: The remote k8s-backup-linstor is not an s3 remote.
- The docs don't seem to mention this, but ok...
linstor backup ship k8s-backup-linstor ... ... => ERROR: Remote 'k8s-backup-linstor': Unknown Cluster. Source Cluster ID: 4ac9438a-ead8-4503-846b-56440ce1412a
- ??? What does this even mean? Did I forget something in the setup? I specified --cluster-id and --passphrase upon creation of the remote
- Let's try to delete the snapshot and retry...
linstor delete snapshot ... => ERROR: Snapshot definition .. of ... is currently being shipped as a backup. Please wait until the shipping is finished or use backup abort --create
linstor snapshot ship-list shows nothing
linstore backup abort ... => SUCCESS: Successfully aborted all in-progress backup-shipments of resource pvc-b59b6cfc-0d11-4b33-9d93-7133976f301a
- linstor delete snapshot ... => ERROR: Snapshot definition .. of ... is currently being shipped as a backup. Please wait until the shipping is finished or use backup abort --create
- linstore backup abort doesn't seem to actually do anything - it even reports success if you provide a non-existent remote
I'm in a deadlock situation now: the failed backup cannot be aborted/deleted. Because of this, I can't delete the snapshot or retry/restart the backup process. The error messages do not help to resolve the situation at all.

What's going on here? Linstor has been rock-solid for volume management and replication so far, but the backup features just don't seem to work at all. On my first experiments, I even bricked my Linstor setup completely (every command would throw NullPointerExceptions related to some deleted remote and the controller wouldn't start anymore).

Are those features just not production ready? Is something wrong in my setup?

Environment:

Kubernetes 1.24.1
- 3 control nodes, 4 workers
Piraeus 1.9.1
- installed via helm chart (with the K8s storage backend, everything else is default)
Linstor Server 1.19.1
Linstor Client 1.14
DRBD 9.1.8
LVM (on every worker)

mbleichner commented 2 years ago

After a little digging through the code and some more trial and error I found out that the "Unknown Cluster" error message means that the receiver needs to have the sender registered as a remote, even if it cannot actually reach it.

The docs don't seem to mention this and the error message is super confusing because it's produced on the receiving side, but printed on the sending side.

I was now able to linstor backup ship a resource, but now it's stuck in "Restoring" state on the receiving end, with no indication about what the problem is.

Now both the snapshots on the sender and the receiver are stuck. Is there any way to force cancel/delete those stuck processes?

ghernadi commented 2 years ago

Hello,

Regarding your S3 backup tests: For me this sounds like your network could not keep up with uploading the ~20 backups at once. Right now if Linstor is told to ship X backups, Linstor will start X backup shipments at once. That is also what is happening with scheduling all backups the same time. You could try to create a few more schedules at different times and assign the resources to different schedules to deal with these network issues.

Regarding your Linstor backup tests: Thanks for the hints regarding missing information / mentions in the documentation, we will look into them Your assumption that both clusters should know each other by cluster-id is correct. Regarding "already being shipped, please use abort": We know this bug and it is quite hard to track / reproduce. We will obviously investigate further there, but for now you should be able to "unstuck" both of your clusters by simply restarting both controllers. That should allow you to delete the corresponding snapshots. Although that should also help with the NullPointerExceptions, are there any other non-NullPointerException related ErrorReports? Maybe from one of the abort attempts or anything suspicious? (I also don't mind if you zip all ErrorReports and let me go through them. If you don't want to share them publicly, please find my email in my GitHub profile)

Regarding your success with backup ship but stuck in "Restoring":

The docs don't seem to mention this and the error message is super confusing because it's produced on the receiving side, but printed on the sending side.

I agree that this might be confusing, but Linstor can only response to requests. Since there is no client-request on the receiving site, all responses has to go to the sender site. However, we will try to make such messages clearer.

I was now able to linstor backup ship a resource, but now it's stuck in "Restoring" state on the receiving end, with no indication about what the problem is.

I might be wrong here, but are you sure the backup was already fully transferred? While the backup is still being shipped, the client is expected to show "Shipping" (on sender site) or "Restoring" (on receiver site) during snapshot list. Once the backup is fully shipped, both sites should show "Successful". So maybe what you thought of being "stuck" might just be the sending process?

Is there any way to force cancel/delete those stuck processes?

Either way, if the shipment is really stuck (you can always check the process list of sender / receiver if the process using socat or zfs send, zfs receive or thin_send, thin_recv are running), you can always restart the controller, that should get you out of such stuck states.

mbleichner commented 2 years ago

Thanks a lot for your answer.

All our nodes are connected through 10GBit network - so in theory, the network shouldn't be a limiting factor here. I'll try to make a few measurements of the network load and report back.

About the deadlocked snapshots/backups: I tried restarting both controllers (and also the satellites), but the state stays the same.

The sender shows all backup snapshots as "Completed" in linstor snapshot list
The receiver shows all backup snapshots as "Restoring". No resources are being created and the snapshots are undeletable.

In the Linstor backup case, I made some progress - this morning I got a new error message in the error-reports that didn't show up on my earlier tries. It indicated that the receiver couldn't establish a connection to the sender. So I added a firewall exception (receiver -> sender) and voila, the snapshot shipping is actually successful on both sides now.

The firewall exception is a bit questionable though - if I'm sending snapshots in one direction, why do I need a firewall rule in the opposite direction (receiver -> sender)? This makes no sense to me and seems a bit counter-productive with regards to security aspects.

ghernadi commented 2 years ago

I'll try to make a few measurements of the network load and report back.

Sounds good. You could also increase logging level to TRACE level to get more details about what Linstor is doing. https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-linstor-logging (step 3 might be the easiest)

The receiver shows all backup snapshots as "Restoring". No resources are being created and the snapshots are undeletable.

Sorry, I was not clear enough before. Restarting the controller should have cleared the internal cache that prevents a shipment from being aborted. So abort should work now, and after that deleting the snapshots should also work.

The firewall exception is a bit questionable though - if I'm sending snapshots in one direction, why do I need a firewall rule in the opposite direction (receiver -> sender)? This makes no sense to me and seems a bit counter-productive with regards to security aspects.

I agree on that. I will have to take a look into the code, but I think this is something like a callback mechanism that the receiver has to inform the sender that receiving is either done or is ready to start...

mbleichner commented 2 years ago

Restarting the controller should have cleared the internal cache that prevents a shipment from being aborted. So abort should work now, and after that deleting the snapshots should also work.

Unfortunately, this doesn't work either - after the restart, linstor backup abort reports success, but when I try to delete the snapshot, I still get "ERROR: Snapshot definition ... of resource ... is currently being shipped as a backup. Please wait until the shipping is finished or use backup abort --create". No additional errors show up in linstor error-reports list.

ghernadi commented 2 years ago

I think I found at least one place in the code that contributes to this issue, but we will need to investigate further. Right now our backup shipping expert is on vacation, but we will certainly address this issue again.

Anyway, unfortunately I am not sure what we can do now, except manipulate the database directly to get out of this situation. We did not find a reliable way to reproduce such issues regarding backup abort.

Is this just a test-setup or do you need help fixing this issue via database manipulation?

mbleichner commented 2 years ago

At the moment, it's still a test setup and right now I'm wiping it to start over with a fresh install.

mbleichner commented 2 years ago

Next update:

I reset my main cluster and tried the L2L scenario again. Manual shipping between clusters via linstor backup ship worked like a charm after the remote access issues have been sorted out.

Then I tried to set up a schedule for backup automation. Here's what happened:

At the specified time, an entry for every volume was created in linstor snaphot list - so far, so good
a few seconds later, 17 of the 22 snapshots failed with "satellite disconnected"
The log showed 16 NullPointerExceptions and one ImplementationError
- implementation-error.txt
- nullpointerexception.txt
The client stayed responsive the whole time (unlike in the S3 scenario)
The network load was near zero (all volumes contained basically zero data, so not very surprising)
On the receiver side, all the snapshots appeared
- Only a single one was marked as successful
- All other snapshots are stuck in state "restoring" and one even "incomplete"
The log on the receiver side showed a bunch of ImplementationErrors and ApiRcExceptions
- api-rc-exception.txt
- implementation-error-2.txt
- (Looking at the PVC IDs, the ImplementationErrors probably result from my earlier experiments, when my remote was misconfigured - I didn't wipe the receiver side. It should still get fixed, probably)
The snapshots on the sender side cannot be deleted because Linstor thinks they are still being shipped
The snapshots on the receiver side cannot be deleted because they are in "restoring" state

emteelb commented 2 years ago

Based on your feedback and details in this issue, we have made some changes to the LINSTOR User's Guide Shipping Snapshots Within LINSTOR section that hopefully draw better attention to the requirement of creating a linstor-remote (source cluster) on the target cluster.

Also added words about the "Unknown Cluster" error message so that in the future, someone might be able to land on this section in the User's Guide from a search for those words.

Thank you for the feedback and details that you provided.

mbleichner commented 2 years ago

Good to hear. I have given up my experiments for now, though.

It seemed to me that Linstor just isn't made for my use case - as soon as a lot of stuff happens simultaneously, everything starts to break down:

Setting up a helm chart with > 20 PVCs led to failed volumes that had to be fixed by hand
Setting up a backup shipping schedule for a lot of volumes led to the issues I described above
Triggering a lot of actions often led to the client getting random connection errors and/or being unreponsive

Doing everything one by one works very well, but as soon as multiple things happen at once, there seem to some issues with race conditions or locks/timeouts.

emteelb commented 2 years ago

I can only speak to the documentation. Others on the team here will be investigating the code, based on the details that you provided in this issue.

I expect that they will post updates here as they make progress.

Thanks again.

jokucera commented 2 years ago

So we finally had time to look deeper into this issue. The only way we were able to reproduce the drastic slowdowns during backup shippings you described was by having a machine run out of RAM. This is definitely an issue and we will be implementing ways to work around it other than having to somehow increase the RAM of the used machines, but if you are certain that your machines had enough RAM it would be good to know so that we can investigate further. Also, it seems that zstd was a big RAM-eater, so functions to optionally disable zstd for backup shipping will be coming as well. Other than that, the error reports you sent us were very helpful and allowed us to fix several bugs - these changes will be in the next release.

LINBIT / linstor-server