LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
953 stars 76 forks source link

Scheduled backup shipping (S3/Linstor) is unusable #303

Open mbleichner opened 2 years ago

mbleichner commented 2 years ago

I'm trying to set up an off-site backup and so far haven't gotten anything to work correctly. The goal is to create a backup destination separate from the main cluster where I can store snapshots of my volumes. The backup site needs to be able to function without network access to the main cluster, so that we have emergency access to our data if the cluster somehow breaks down.

Here's my experience so far:

S3 Backup (MinIO Server)

Linstor Backup (to a separate single node Linstor server)

What's going on here? Linstor has been rock-solid for volume management and replication so far, but the backup features just don't seem to work at all. On my first experiments, I even bricked my Linstor setup completely (every command would throw NullPointerExceptions related to some deleted remote and the controller wouldn't start anymore).

Are those features just not production ready? Is something wrong in my setup?

Environment:

mbleichner commented 2 years ago

After a little digging through the code and some more trial and error I found out that the "Unknown Cluster" error message means that the receiver needs to have the sender registered as a remote, even if it cannot actually reach it.

The docs don't seem to mention this and the error message is super confusing because it's produced on the receiving side, but printed on the sending side.

I was now able to linstor backup ship a resource, but now it's stuck in "Restoring" state on the receiving end, with no indication about what the problem is.

Now both the snapshots on the sender and the receiver are stuck. Is there any way to force cancel/delete those stuck processes?

ghernadi commented 2 years ago

Hello,

Regarding your S3 backup tests: For me this sounds like your network could not keep up with uploading the ~20 backups at once. Right now if Linstor is told to ship X backups, Linstor will start X backup shipments at once. That is also what is happening with scheduling all backups the same time. You could try to create a few more schedules at different times and assign the resources to different schedules to deal with these network issues.

Regarding your Linstor backup tests: Thanks for the hints regarding missing information / mentions in the documentation, we will look into them Your assumption that both clusters should know each other by cluster-id is correct. Regarding "already being shipped, please use abort": We know this bug and it is quite hard to track / reproduce. We will obviously investigate further there, but for now you should be able to "unstuck" both of your clusters by simply restarting both controllers. That should allow you to delete the corresponding snapshots. Although that should also help with the NullPointerExceptions, are there any other non-NullPointerException related ErrorReports? Maybe from one of the abort attempts or anything suspicious? (I also don't mind if you zip all ErrorReports and let me go through them. If you don't want to share them publicly, please find my email in my GitHub profile)

Regarding your success with backup ship but stuck in "Restoring":

The docs don't seem to mention this and the error message is super confusing because it's produced on the receiving side, but printed on the sending side.

I agree that this might be confusing, but Linstor can only response to requests. Since there is no client-request on the receiving site, all responses has to go to the sender site. However, we will try to make such messages clearer.

I was now able to linstor backup ship a resource, but now it's stuck in "Restoring" state on the receiving end, with no indication about what the problem is.

I might be wrong here, but are you sure the backup was already fully transferred? While the backup is still being shipped, the client is expected to show "Shipping" (on sender site) or "Restoring" (on receiver site) during snapshot list. Once the backup is fully shipped, both sites should show "Successful". So maybe what you thought of being "stuck" might just be the sending process?

Is there any way to force cancel/delete those stuck processes?

Either way, if the shipment is really stuck (you can always check the process list of sender / receiver if the process using socat or zfs send, zfs receive or thin_send, thin_recv are running), you can always restart the controller, that should get you out of such stuck states.

mbleichner commented 2 years ago

Thanks a lot for your answer.

All our nodes are connected through 10GBit network - so in theory, the network shouldn't be a limiting factor here. I'll try to make a few measurements of the network load and report back.

About the deadlocked snapshots/backups: I tried restarting both controllers (and also the satellites), but the state stays the same.

In the Linstor backup case, I made some progress - this morning I got a new error message in the error-reports that didn't show up on my earlier tries. It indicated that the receiver couldn't establish a connection to the sender. So I added a firewall exception (receiver -> sender) and voila, the snapshot shipping is actually successful on both sides now.

The firewall exception is a bit questionable though - if I'm sending snapshots in one direction, why do I need a firewall rule in the opposite direction (receiver -> sender)? This makes no sense to me and seems a bit counter-productive with regards to security aspects.

ghernadi commented 2 years ago

I'll try to make a few measurements of the network load and report back.

Sounds good. You could also increase logging level to TRACE level to get more details about what Linstor is doing. https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-linstor-logging (step 3 might be the easiest)

The receiver shows all backup snapshots as "Restoring". No resources are being created and the snapshots are undeletable.

Sorry, I was not clear enough before. Restarting the controller should have cleared the internal cache that prevents a shipment from being aborted. So abort should work now, and after that deleting the snapshots should also work.

The firewall exception is a bit questionable though - if I'm sending snapshots in one direction, why do I need a firewall rule in the opposite direction (receiver -> sender)? This makes no sense to me and seems a bit counter-productive with regards to security aspects.

I agree on that. I will have to take a look into the code, but I think this is something like a callback mechanism that the receiver has to inform the sender that receiving is either done or is ready to start...

mbleichner commented 2 years ago

Restarting the controller should have cleared the internal cache that prevents a shipment from being aborted. So abort should work now, and after that deleting the snapshots should also work.

Unfortunately, this doesn't work either - after the restart, linstor backup abort reports success, but when I try to delete the snapshot, I still get "ERROR: Snapshot definition ... of resource ... is currently being shipped as a backup. Please wait until the shipping is finished or use backup abort --create". No additional errors show up in linstor error-reports list.

ghernadi commented 2 years ago

I think I found at least one place in the code that contributes to this issue, but we will need to investigate further. Right now our backup shipping expert is on vacation, but we will certainly address this issue again.

Anyway, unfortunately I am not sure what we can do now, except manipulate the database directly to get out of this situation. We did not find a reliable way to reproduce such issues regarding backup abort.

Is this just a test-setup or do you need help fixing this issue via database manipulation?

mbleichner commented 2 years ago

At the moment, it's still a test setup and right now I'm wiping it to start over with a fresh install.

mbleichner commented 2 years ago

Next update:

I reset my main cluster and tried the L2L scenario again. Manual shipping between clusters via linstor backup ship worked like a charm after the remote access issues have been sorted out.

Then I tried to set up a schedule for backup automation. Here's what happened:

emteelb commented 2 years ago

Based on your feedback and details in this issue, we have made some changes to the LINSTOR User's Guide Shipping Snapshots Within LINSTOR section that hopefully draw better attention to the requirement of creating a linstor-remote (source cluster) on the target cluster.

Also added words about the "Unknown Cluster" error message so that in the future, someone might be able to land on this section in the User's Guide from a search for those words.

Thank you for the feedback and details that you provided.

mbleichner commented 2 years ago

Good to hear. I have given up my experiments for now, though.

It seemed to me that Linstor just isn't made for my use case - as soon as a lot of stuff happens simultaneously, everything starts to break down:

Doing everything one by one works very well, but as soon as multiple things happen at once, there seem to some issues with race conditions or locks/timeouts.

emteelb commented 2 years ago

I can only speak to the documentation. Others on the team here will be investigating the code, based on the details that you provided in this issue.

I expect that they will post updates here as they make progress.

Thanks again.

jokucera commented 2 years ago

So we finally had time to look deeper into this issue. The only way we were able to reproduce the drastic slowdowns during backup shippings you described was by having a machine run out of RAM. This is definitely an issue and we will be implementing ways to work around it other than having to somehow increase the RAM of the used machines, but if you are certain that your machines had enough RAM it would be good to know so that we can investigate further. Also, it seems that zstd was a big RAM-eater, so functions to optionally disable zstd for backup shipping will be coming as well. Other than that, the error reports you sent us were very helpful and allowed us to fix several bugs - these changes will be in the next release.