Multiply speed of backups by not using / storage during backups

Rudd-O commented 8 years ago

The bottleneck during backups is the write to /var/tmp. This isn't necessary.

What I learned going through the backup code is fairly simple. You tar a file chunk into a file within /var/tmp (possibly with a compressor pipe and possibly with an encryptor pipe), then you hmac the file using openssl.

That isn't needed. You can use the tarfile python module to produce the proper tar file, no need to execute tar into a temporary directory, you can push that through openssl encrypt, and you can also push that (with a tee) into both hmac and the final file. The tarfile module lets you specify the path conversion, so you don't need to deal with string conversion specification in the tar_sparse command line.

This would easily double the speed at which backups happen on my machine, perhaps more. Avoiding writing temporary files in /var/tmp is actually a huge win.

I can't do the code myself for a variety of reasons that escape this filed issue, but I'm happy to explain the workings of the tarfile module.

The pipeline would neet to be like this:

tarfile = (open the backup final tarfile)
...
For file in vmdir:
   for num, (chunkstart, chunklen) in enumerate(yieldoffsets(file)):
      fff = "%03d" % num
      ...encrypt/compress to an anonymous pipe....
      ...fork that pipe into two more pipes, the hmac process stdin and a pipe P...
      ...set the tarinfo fields for the tarred file and the hmac...
      tarfile.addfile(tarinfo, fileobj=P)
      hmacprocess_stdout = read(hmacprocess.stdount)
      tarfile.addfile(tarinfo_hmac, fileobj=output of hmacprocess_stdout)
...

This is trickier than I portrayed it, but it will provide a several-fold improvement in backup performance, and it will save the lifetime of the SSDs for everyone running Qubes on an SSD.

Rudd-O commented 8 years ago

A better way to do this would be to entirely forego the HMAC via openssl, and do the HMAC within Python. That way the HMAC can be computed as you read from the original file (or the encrypted output from openssl), stored in a memory variable, and then added to the backup output. I just don't know how to do that in Python, but I know it should be doable.

andrewdavidwong commented 8 years ago

A better way to do this would be to entirely forego the HMAC via openssl, and do the HMAC within Python.

Would this preserve the ability to do manual HMAC verification with openssl? One of the main benefits of the current backup system is that it allows non-programmer users to recover their data relatively easily using common tools like tar and openssl on any Linux system.

marmarek commented 8 years ago

The problem is that you don't know the chunk size (needed in tar header) before it's finished. Mostly because of compression, but also because of sparse files. As for HMAC calculation - it isn't such a big issue, as you can simply read from openssl stdout (pipe). Calculating HMAC in Python would also be fine, especially after solving #971

Rudd-O commented 8 years ago

@axon-qubes the only change is how the tar file is written, not using tar, but using tarfile.py. Everything else stays the same. Naturally, the HMAC in Python should output the same as openssl hmac.

@marmarek explain that tar header issue. I lack an understanding of it. Thanks.

marmarek commented 8 years ago

File list of the outer archive looks like this:

vm1/private.img.000
vm1/private.img.000.hmac
vm1/private.img.001
vm1/private.img.001.hmac
vm1/private.img.002
vm1/private.img.002.hmac
...

To write any of this to the output stream, first you need to send a file header, which contains file size. You don't know the file size until that file is fully written, so you can't pipe it directly from the inner tar layer. Theoretically you can cache it in memory instead of /var/tmp, but IMO it isn't good idea, can lead to out-of-memory. Also it isn't good for speed - now backup tool simultaneously create that chunk files and send them to the output device at the same time (so when chunk N+1 is prepared, chunk N is written to the output device). Piping those two things together would mean that one would need to wait for the other one.

But, if that's only about SSD lifetime, and you have enough RAM, maybe using tmpfs would be enough? /tmp is mounted as tmpfs by default, so simply an option to specify temporary directory would do the job. What do you think @Rudd-O ?

Rudd-O commented 8 years ago

The big files are always max 100 MB in size. It's even hardcoded in line 632 of backup.py. This isn't going to lead anyone into an OOM situation.

If you are worried about garbage collection churn, Just allocate the 100 MB as a buffer, and then read into the buffer until the buffer is full. Right now, you're writing to chunkfile_p -- why not modify wait_backup_feedback to write to the buffer instead?

Allocate a bytearray(1024x1024x1024)
Pass the bytearray to wait_backup_feedback.
Make wait_backup_feedback write the output of in_stream into the bytearray. (len = in_stream.readinto(bytearray)) 4 . Make wait_backup_feedback return "size_limit" if the len of the read is 1024x1024x1024.
Now you have the data, and you can compute the file name xxxx.XXX, as well as the tarinfo size / attributes. Construct a tarinfo structure, and write it to disk.

To improve parallelism and therefore performance, you can push the tarinfo structure into a Queue.Queue of a certain maximum depth, and run the "write the outer tar" process into a thread, consuming tarinfos from the Queue as you go along. You would need a circular buffer, but it's doable. In fact, the hmac process can also read from that buffer, as well as produce tarinfos that get pushed into that queue, so the tarinfo gets computed and written in parallel. I will shortly post pseudocode of that.

This way, the only limits are:

the read speed of the device backing the VMs
the write speed of the device receiving the backup

marmarek commented 8 years ago

The big files are always max 100 MB in size. It's even hardcoded in line 632 of backup.py. This isn't going to lead anyone into an OOM situation.

But you can have multiple those files. Maximum number is also hardcoded: 10. So it is 1GB.

To improve parallelism and therefore performance, you can push the tarinfo structure into a Queue.Queue of a certain maximum depth, and run the "write the outer tar" process into a thread,

This is actually very similar to the current implementation.

Anyway, using tmpfs for those files would give the same performance benefits. But also will be much easier to understand and debug that code.

Rudd-O commented 8 years ago

As promised:

https://gist.github.com/Rudd-O/da8bc169e2cccb3a3707

This thing goes faster than qvm-backup, writes nothing to SSD or HDD, and deals with parallelism (mostly) correctly. You can plug more parallel tasks and connect them with stdin/stdout as you see fit. My choice was lambdas, but hey, what else could I choose given the circumstances?

Given that we're reading from a pipe, my desires to use mmap() or something like that went poof in the air. Very sad. But anyway, I spent a few hours, sorry for the time it took me to reply, to get this together. Here is a more parallel, threadsafe, error-detecting, shitty, incomplete qvm-backup. Ouch.

The big feature of this chunkacode is that it binds memory consumption to a multiple of (Numtasks*Sizereadbuffer) such that you can regulate those parameters as the end user (you shouldn't have to, though). With numtasks=4, you can have at most 1 Sizereadbuffer being read into, and 3 Sizereadbuffer being processed and written to the tarfile at the same time (that's 1 buffer being filled, 1 or 2 separate buffers being written to the tarfile, and the the complement of 1 or 2 buffers being hmac'd in its way to the tarfile). The number of tasks (excluding the read task, which happens in the main thread) never exceeds 4 under the default values. My experiments locally -- look at the time.sleep(5)'s to simulate slow writes -- show that with Sizereadbuffer=256MB and numtasks=4, you never go above 600 MB RAM (all swappable). If, as an user, you want the thing to take 1 GB, 256 MB, or whatever, you the user can tune those parameters. Assuming a decent-sized swap file, and also assuming worst conditions, there will be no OOM with numtasks=4 and the 1 GB read size, just a slowdown. The right thing to do here is to allocate N threads' memory buffers but I could not find a way to make that work with tarfile.py. That sucks, really, but it's only a limitation that can be removed by writing the right code. The parallelism framework is there to make it work. The optimal values are N being the lowest of the number of processors and the number of Sizereadbuffers that fit in the memory of the dom0 divided by half the number of processors, and of course Sizereadbuffers the largest chunk of contiguous allocatable RAM without swapping divided by two.

Using tmpfs would probably help a lot. But there is no reason me, the end user, should shadow /var/tmp on the whole system with a tmpfs, just so I can run a backup. This should work right out of the box. So let's not force hacks down users' experiences.

(Yes, there's the pesky "you don't have a free gig of RAM? you're kinda fucked now." problem with my code. That was sort of already a problem with the current code (without that gig of RAM, stuff gets swapped out of existence and into a crawling user experience anyway), but I'll admit it's a problem with this code too. This can be fixed by reducing the chunk size. That reduces the total size of what can be backed up, of course (102410241024256 , or whatever value, instead of 999), but then again that limitation was an outstanding problem in qvm-backup. Also, smaller chunks mean targeted tampering with data has less of an impact in uncompressed backups.

marmarek commented 8 years ago

But there is no reason me, the end user, should shadow /var/tmp on the whole system with a tmpfs, just so I can run a backup. This should work right out of the box. So let's not force hacks down users' experiences.

Yes, of course. I've written "/tmp is mounted as tmpfs by default, so simply an option to specify temporary directory would do the job. What do you think @Rudd-O ?" Maybe even it should have some automatic detection, based on available RAM?

deals with parallelism (mostly) correctly.

I don't see how order of files in output archive would be preserved - probably not. But that would be a minor change - simply plugging Queue somewhere there (instead of lock in SerializedTarWriter?)

Rudd-O commented 8 years ago

Yes, correct, you could just plug a queue, which would of course make the whole write block on the longest write and you'd lose some performance that way. But the question is why. The order doesn't matter, does it? If it does, I would like to know why and, of course, then it's trivial to revise the code for that purpose.

A meh workaround would be to specify the tempdir in the current script, or to make it honor TMPDIR. That might already be the case. I did not try that.

marmarek commented 8 years ago

Yes, correct, you could just plug a queue, which would of course make the whole write block on the longest write and you'd lose some performance that way. But the question is why. The order doesn't matter, does it? If it does, I would like to know why and, of course, then it's trivial to revise the code for that purpose.

Generally to ease restore process. If the hmac file is just after its data file, you can verify the data file immediately after receiving it. And abort if it's invalid, or missing. Otherwise it would be somehow trickier to make sure that all the files were verified (and then harder to audit that code). This way you can also extract data files as they come - in parallel in retrieving backup stream - because you can assume that data files are in order (and abort if they aren't).

A meh workaround would be to specify the tempdir in the current script, or to make it honor TMPDIR. That might already be the case. I did not try that.

No it doesn't honor TMPDIR (https://github.com/QubesOS/qubes-core-admin/blob/master/core/backup.py#L513)...

Anyway take a look at this: https://github.com/marmarek/qubes-core-admin/pull/10

On 3.5GB sample VM, SSD disk, no compression, no encryption, it makes a difference, but not that big: 2:53 vs 3:19

Rudd-O commented 8 years ago

Ah. In that case ordering would work fine.

So, any chance that my proto code could be adapted to qubes-backup?

marmarek commented 8 years ago

So, any chance that my proto code could be adapted to qubes-backup?

Maybe as part of work on #971 (together with #1213), at least parts of it. Some concerns about your approach:

is the tar format the same (as tar -cO --posix)? (just requires checking)
what about non-SSD disks? concurrent handling multiple files may even decrease performance...
having temporary data as files makes it easier to debug, and error recovery (which is actually more important for restoring data, not making backup); but for example debugging errors like using too much memory - if that would be simple files in /tmp, it is trivial to see what is there...

Anyway, surely not for the final R3.1, but I hope it's obvious.

Rudd-O commented 8 years ago

The tar format can be chosen to be ustar or posix. It's in the library.
Handling files concurrently when the read sizes are 100MB (they are) poses no more effort on rotating disks than on SSDs. It'd be a different story if the reads were of many tiny files that cause the heads to seek like crazy. Reading several files at the same time also has the virtue that the elevator can help be more efficient and read more bytes per second, increasing throughput. If this would decrease the performance of other tasks in the machine, then it's likely that what's needed is to set the disk priority of the backup process to idle or batch.
Writing temporary files that get deleted doesn't make it easier to debug the problem. Backtraces, debuggers and tests do. It also doesn't help in restoring data to write temporary files during the backup (though perhaps during the restore it might).

I understand it can't be done for 3.1 tho. No worries.

The general issue here is not just to make the backup faster, but to save the machine from doing unnecessary work. Writing to /tmp saves work but -- because a copy must be written and then read, adding more context switches -- not as much as loading the data into buffers and writing it directly.

QubesOS / qubes-issues

Multiply speed of backups by not using / storage during backups #1652