LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
984 stars 76 forks source link

Backup restore is very slow #426

Open jesulo opened 1 month ago

jesulo commented 1 month ago

I'm doing a backup restore on a ct that weighs 500gb, but only has 80gb occupied. When the backup is on the local disk it takes 3 and a half hours, but when it is on pbs it takes 7 hours. Because in both cases it takes a long time. Is there a way to reduce the time? They are on zfs with linstor. Regards

ghernadi commented 1 month ago

I am not sure what you are actually looking for?

What is "ct", what is "pbs"?

What do you mean with "When the backup is on the local disk it takes 3 and a half hours"? When you already have the backup locally available, restoring the backup (or rather the snapshot) into a new LINSTOR-resource should only take a few seconds, not 3.5h.

What is the download-speed of the satellite that downloads the backup? What would be the time you would expect for 80GB to be downloaded (and why)?

jesulo commented 1 month ago

I mean an lxc container or a proxmox vm. Pbs is the proxmox backup server. Yes, the restoration of the container backup takes 3 and a half hours on a local disk and when I do it from the pbs it takes longer. What settings should I make so that it doesn't take so long? How do I see the download speed? In the restore log it says that the restore speed was 5 Mb. Maybe it's because I used zfs? Or for HA replication?

ghernadi commented 1 month ago

If you are restoring from proxmox backup server, I assume the data is getting copied and possibly sent to the other peers via DRBD.

This is more of a performance tuning question than an actual bug, so I would suggest that you do some testing. I.e. try to restore a resource into a resource that has only 1 replica. The idea is that regardless if you have DRBD configured or not, if there are no other diskful DRBD peers, the restore-operation will not depend on your network speed. If this test is much faster than what you have right now, you will want to investigate further into network optimizations and DRBD tuning (for example https://kb.linbit.com/tuning-drbds-resync-controller, but feel free to further google).
If the results are someone similar to what you have right now, the network is not a problem. I would doubt that DRBD would be an issue with local writes, so my next guess is to check your storage speed by restoring into a storage-only resource. If that is also slow, it depends on your setup where to continue the investigation. If you are using VMs, check how the disk-IO is mapped from the virtual machine to the physical hardware and see if you can optimize things there.

From what you have said until now, this does not look like an issue with LINSTOR at all, since LINSTOR is not even in the IO path in these use-cases. My guess is that the bottleneck is either your network's or your storage's speed (check both, the reading as well as the writing storage).

jesulo commented 1 month ago

I modified rs-discard-granularity to 1M, but the slowness continues. I've noticed that the I/O is very high; when restoring, it even impacts other virtual machines on the same disk. Could you tell me what configurations I could apply so that replication with the other node doesn't affect I/O too much? Can it be configured as asynchronous or lower the priority of replication? What properties do you recommend that I modify? Thanks.