Improve compression performances

schnuffle commented 6 years ago

virt-backup uses the tarfile module to create the compressed tar archives.
the tarfile module is single process only
On my system that leads to a CPU running at 100% and though limiting file the transfer speed

Wit compression set to None i get transfer speeds of around 200MB/s, with xz/gz compression I'm down to 10 MB/s.

I digged a bit into the problem and from my limited experience, I don't see any easy solution but to use compression set to None.

Just wanted to mention my observation.

aruhier commented 6 years ago

Hi, Thanks a lot for the issue!

I saw you forked and maybe tried to implement an lz4 compression. It could be a solution, between others like trying zstd and/or lzo.

In fact, about the CPU consumption needed by the compression, I already thought a bit about this, and did not find it to be a real issue if I add an option to backup multiple vm simultaneously. The CPU consumption of a backup would be predictable, as you would know that X backups max are running, consuming X cores. But indeed, a faster and efficient compression algorithm (and I really think at zstd here, which, imo, would be great for backups) would be a plus.

I am really sorry, I am not actively working on this tool, and try to fix some things and add tests whenever I can and am motivated. If you want to help and this project, I will be very happy to work with you, and it will motivate me a lot to add some new features ;)

schnuffle commented 6 years ago

I've some backup needs for a libvirt based KVM stack and am just testing if virt-backup can do the job. I found one solution so far that clears the mentioned bottleneck:

Use virt-backup without compression
Create a tar archive and use pigz for parrallel compression: tar cf - *.files | pigz > archive.tar.gz

I think the lz4 route is dead already as the my first tests with the python lz4tools module showed a massive RAM consumption. My favorite route would have been to use ZFS with snapshots and zfs send, but I needed to fallback to standard Soft RAID10+LVM. So right now, the pigz hack is workable and maybe the easy solution is to let python pipe the tar stream into pigz. I'll be happy to give feedback about anything that might help improve the software, though I don't see myself as a software developper, I'm more of a integrator :)

Anyway

Thx for the software

Schnuffle

schnuffle commented 6 years ago

Just checked some docs about zstd, sounds like an idea, the ZstdCompressor offer a multithread option.

Regards

Schnuffle

aruhier commented 6 years ago

I'll be happy to give feedback about anything that might help improve the software, though I don't see myself as a software developper, I'm more of a integrator :)

No worries ;)

I would prefer to avoid pigz, as it would require to use tar as a subprocess and not as a python library, which gives me more control during the compression. Maybe I will see if I can just move from the tar library to the direct library of the wanted compression algorithm (xz, zlib, zstd, etc.). I preferred using tar because it support folders in the archive, but it is not really needed. It will probably allow to multithread the compression.

I rename this issue in this goal.

eayin2 commented 5 years ago

I think there is nothing wrong with using subprocceses with pigz. You can use stdout/stdin to control the subprocess.

aruhier commented 5 years ago

At first I thought I would be easier to print the progression, as I compress the images through a stream, on which I have some control on. But the progression doesn't work correctly after all, so I could.

I thought about switching to zstd, that supports multithreading on python, and keeping the actual implementation for "legacy" formats.

I also started to implement some parallelism on VM backups, but I need to finish my tests before merging it: https://github.com/Anthony25/virt-backup/tree/multithreading Tests are in this case really important to ensure that there is no deadlock between backups, and that 2 backups will not be running at the same time for the same domain.

I'm sorry, this project is not under my focus right now. I daily use it, but I can't give any roadmap for new features.

eayin2 commented 5 years ago

Didn't know about zstd, that looks like a good option too. For now I just use uncompressed backups, that worked fine for me, considering my qcow2 are compressed anyways.

aruhier commented 5 years ago

I need to do some real tests in multithreading, but it should be good if you use the multithreaded branch and add this in your config file: https://github.com/Anthony25/virt-backup/commit/2559e720c0f2586e475382a120ca8a9db6cd8250

aruhier commented 5 years ago

I plan to add zstd as a compression method, but it requires some rework/split on the way it's currently handled.

aruhier commented 5 years ago

Hi, I merged the multithread feature in dev. I have tested it on my side for a few days now, working well so far. Do you mind testing it to know if I can merge it in master and draft a new release with that?

It should improve the backup performances if you setup it for multiple threads.

aruhier commented 5 years ago

Linked to #20

aruhier commented 5 years ago

I've moved the logic of adding/extracting an image from a tar or a directory (in what I named a "packager"), which allowed me to include quite easily zstd there. I went to a quite simple way of doing one archive per image, and one folder per backup. Normally, with zstd, I could have pack everything in one archive, but if a user wanted to manually restore a disk, they would have to deal with some offsets and reverse engineering my code. This will be way easier for the user.

It's for not included yet in the backups, as I need to add a shared way of deleting a packager and change a bit the way options are given to the compression algorithm. Normally, it will be possible to configure a certain amount of threads for zstd, for example, but still keeping the thing generic.

aruhier commented 4 years ago

I added zstd in the last release, v0.4.0. The compression performances are way better now!

You can use it by using this configuration for each group:

packager: zstd
packager_opts:
    compression_lvl: 6

Be sure to remove any usage of compression and compression_lvl in your configuration file, it has been deprecated and replaced by these 2.

Edit: and to use zstd, you need to install the zstandard python package:

$ pip3 install zstandard

aruhier / virt-backup

Improve compression performances #15