Closed andrewdavidwong closed 7 years ago
Just to summarize everything so far, it sounds like there are two three main options on the table:
Pros:
Cons:
enc
function (poorly-written, little-used). If, by contrast, the HMAC-related functions are better-written and more widely-used, then this may not be a problem.Pros:
Cons:
-md sha256
to openssl enc
Pros:
Cons:
sha512(passphrase)
to openssl dgst
.enc
isn't totally neglected.(As I mentioned above, when it comes to something like backup encryption, I'd prefer to see Qubes stay on the conservative side, so, FWIW, I'm leaning toward option 1. Option 3 might be better for now, if it can be done immediately at little cost yet provide a pareto improvement over the current system.)
- Still relies on OpenSSL
- The problems with OpenSSL are with the
enc
function (poorly-written, little-used). If, by contrast, the HMAC-related functions are better-written and more widely-used, then this may not be a problem.
If I understand correctly, using openssl for verification still requires some KDF...
One more scrypt cons:
Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
If I understand correctly, using openssl for verification still requires some KDF...
I think that depends on what we consider the "requirements" to be. Currently, we use OpenSSL for verification without any kind of KDF. We just feed the user's passphrase directly to openssl dgst -hmac
.
One option would be to keep doing this. Since GPG applies its own KDF (S2K) to the passphrase, the resultant verification and encryption keys would be significantly different. However, the passphrases fed to openssl
and gpg
would still be the same, and that's undesirable. (E.g., exploiting a flaw in OpenSSL's implementation of HMAC-SHA512 which allows the attacker to recover the passphrase would allow her to then immediately decrypt the GPG-encrypted data. Unlikely, perhaps, but easily avoided by simply by using two different passphrases.)
OpenSSL is really just there to protect GPG from unverified input. So, if all we really want is for the two passphrases to be different, a simple solution could be something like: feed sha512(passphrase)
to openssl
and feed passphrase
to GPG (on which GPG will then apply its own KDF).
IMHO, our goal isn't to apply our own key-stretching (we trust the user to supply a reasonable passphrase). Rather, it's just to avoid "key-shortening" (which the openssl enc
's single round of md5 currently does in many cases) and to avoid the "same passphrase" problem mentioned above.
Either way, it would still better than what we do right now. Plus, a solution like this ensures that disaster recovery remains possible without yet another external tool (i.e,. the user doesn't need a copy of exotic-kdf-tool
, just the ability to compute sha512(their_passphrase)
).
It may not be "ideal," but it seems like it would fix the main problems we currently have, which are: (1) the issues with openssl enc
(entropy-reducing "KDF," code neglect) and (2) using the same passphrase for verification and encryption. I worry that any more "ideal" solution would require much more code and complexity (and therefore not really be ideal).
One more scrypt cons:
- it isn't easy for scripted usage - reads password from /dev/tty. I can workaround this, but it will not be nice code...
Added.
OpenSSL changes between 1.0.2g and 1.1.0:
*) Changed default digest for the dgst and enc commands from MD5 to
sha256
[Rich Salz]
This reminds me that there's a third option: simply start passing -md sha256
. I'll edit this back into the previous post, along with pros and cons.
I think that depends on what we consider the "requirements" to be. Currently, we use OpenSSL for verification without any kind of KDF. We just feed the user's passphrase directly to
openssl dgst -hmac
.
Isn't that exactly what we're trying to solve in this ticket? Weak/no KDF usage for (any of) authentication or encryption. If I understand correctly, this makes attack easier because someone may launch attack on password by just attacking hmac, then use guessed password for decryption. So if any of those will be weak (cheap to launch dictionary/bruteforce attack), it will help with attacking the other part, even if decent KDF is used in that other part.
As for -md sha256
idea - we already pass -sha512
to openssl dgst
. But not openssl enc
- here it will indeed somehow improve the situation. At least it will not reduce passphrase to 128 bits.
Also, indeed some idea may be using SHA512(passphrase) for both operations. Or even SHA512(passphrase + "hmac") and SHA512(passphrase + "enc"). This will produce different keys. This looks like solution for the original problem, but as I'm not a cryptographer, I don't know if generally a good idea... Similar idea was raised in https://github.com/QubesOS/qubes-issues/issues/971#issuecomment-151125927
Isn't that exactly what we're trying to solve in this ticket? Weak/no KDF usage for (any of) authentication or encryption.
Well, it's one of the issues. The current three problems are:
md5(passphrase)
is capping entropy at 128 bits.dgst
and enc
.openssl enc
seems shoddy (admittedly this one is debatable).If I understand correctly, this makes attack easier because someone may launch attack on password by just attacking hmac, then use guessed password for decryption. So if any of those will be weak (cheap to launch dictionary/bruteforce attack), it will help with attacking the other part, even if decent KDF is used in that other part.
Yes, this is almost exactly the same situation I mentioned above.
But, again, there is a big middle-ground between (a) passing the same raw passphrase to dgst
and enc
, and (b) adding some full-blown KDF to the process. One example of such a middle-ground solution is using, e.g., sha512(passphrase)
(discussed below).
As for
-md sha256
idea - we already pass-sha512
toopenssl dgst
. But notopenssl enc
- here it will indeed somehow improve the situation. At least it will not reduce passphrase to 128 bits.
Yes, and it seems like there's no reason not to do this immediately. After all, it will happen by default if/when we upgrade to OpenSSL 1.1.0. Might as well start now and reap some benefit in the meantime.
Also, indeed some idea may be using SHA512(passphrase) for both operations. Or even SHA512(passphrase + "hmac") and SHA512(passphrase + "enc"). This will produce different keys. This looks like solution for the original problem, but as I'm not a cryptographer, I don't know if generally a good idea... Similar idea was raised in #971 (comment)
It may not be optimal from a cryptography standpoint, but surely it is better than what we do now (passing the same bare passphrase to both dgst
and enc
). If it would be trivial to implement, then it seems like we have nothing to lose and a fair amount to gain. Why not pick the low-hanging fruit?
Any reason not to just use a 3rd party OSS tool for this like duplicity? Alternatively, I believe rsync and LUKS containers would fit the bill as well. On Mar 17, 2016 6:22 PM, "Axon" notifications@github.com wrote:
Isn't that exactly what we're trying to solve in this ticket? Weak/no KDF usage for (any of) authentication or encryption.
Well, it's one of the issues. The current three problems are:
- md5(passphrase) is capping entropy at 128 bits.
- Same passphrase is being fed to dgst and enc.
- openssl enc seems shoddy (admittedly this one is debatable).
If I understand correctly, this makes attack easier because someone may launch attack on password by just attacking hmac, then use guessed password for decryption. So if any of those will be weak (cheap to launch dictionary/bruteforce attack), it will help with attacking the other part, even if decent KDF is used in that other part.
Yes, this is almost exactly the same situation I mentioned above.
But, again, there is a big middle-ground between (a) passing the same raw passphrase to dgst and enc, and (b) adding some full-blown KDF to the process. One example of such a middle-ground solution is using, e.g., sha512(passphrase) (discussed below).
As for -md sha256 idea - we already pass -sha512 to openssl dgst. But not openssl enc - here it will indeed somehow improve the situation. At least it will not reduce passphrase to 128 bits.
Yes, and it seems like there's no reason not to do this immediately. After all, it will happen by default if/when we upgrade to OpenSSL 1.1.0. Might as well start now and reap some benefit in the meantime.
Also, indeed some idea may be using SHA512(passphrase) for both operations. Or even SHA512(passphrase + "hmac") and SHA512(passphrase + "enc"). This will produce different keys. This looks like solution for the original problem, but as I'm not a cryptographer, I don't know if generally a good idea... Similar idea was raised in #971 https://github.com/QubesOS/qubes-issues/issues/971 (comment)
It may not be optimal from a cryptography standpoint, but surely it is better than what we do now (passing the same bare passphrase to both dgst and enc). If it would be trivial to implement, then it seems like we have nothing to lose and a fair amount to gain. Why not pick the low-hanging fruit?
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/QubesOS/qubes-issues/issues/971#issuecomment-198107642
If I may for a moment explain -- I do use qvm-backup but I am unhappy with it, because there is no support for encrypted incremental backups. This means I have to back up close to a gig of stuff EVERY TIME. This is slow, it prevents me from using the machine being backed up, and it's so fucking tedious, I rarely do it.
We need a better solution that will (a) support running the backup concurrently with the underlying VMs running (b) support scripting and sending off to external storage beyond just USB drives (c) support incremental backup in the right way, so that backups can be finished faster. I believe what this means is embracing some sort of WAFL / COW file system, and backing it up by replicating (send/recv) to a remote, encrypted disk or disk image. qvm-backup just doesn't cut the mustard.
I agree that requirement to shutdown all the VMs for the backup is not convenient at least...
(c) support incremental backup in the right way, so that backups can be finished faster.
There was a discussion about it linked to #858
(a) support running the backup concurrently with the underlying VMs running
This is also considered as part of #858
(b) support scripting and sending off to external storage beyond just USB drives
This is already possible - you can enter some command instead of directory path and the backup will be streamed to its stdin. For example ssh somehost dd of=/path/on/remote/machine
.
It may not be optimal from a cryptography standpoint, but surely it is better than what we do now (passing the same bare passphrase to both
dgst
andenc
). If it would be trivial to implement, then it seems like we have nothing to lose and a fair amount to gain. Why not pick the low-hanging fruit?
One reason: to not change the backup format too frequently (compatibility, number of combinations to test). There will be new backup format (version += 1) in Qubes 4.0, because qubes.xml format is changed. So we can bundle this change with fixing problem discussed in this ticket.
For now I'll ignore the first option from your list (gpg+openssl), as it looks to only introduce complexity while the same gains can be achieved by much simpler option "3" (openssl enc -md sha512
+ sha512(passphrase + 'hmac')
for openssl dgst
).
So, we have two options:
scrypt
(option "2") - probably the best of those considered from cryptography POV, but with some practical drawbacks as described in https://github.com/QubesOS/qubes-issues/issues/971#issuecomment-197790115The question is, whether option "3" is good enough? It is surely better than the current implementation...
@Rudd-O: Thank you for sharing your experience! I agree with you and @marmarek about the inconvenience of the current system. I also think that an incremental backup system is desirable, but there would still be a need for the ability to create immediate full backups (e.g., for system migration).
One reason: to not change the backup format too frequently (compatibility, number of combinations to test). There will be new backup format (version += 1) in Qubes 4.0, because qubes.xml format is changed. So we can bundle this change with fixing problem discussed in this ticket.
Ah, that's a good point. Changing the format too frequently would be bad. Ok, so bundling with 4.0 sounds good.
The question is, whether option "3" is good enough? It is surely better than the current implementation...
IMHO, we need to be careful not to let the perfect be the enemy of the good. Option 3 makes the current system better without making anything worse. So, if option 3 is not good enough, then the current system falls even shorter of being good enough. In light of this, I can only think of a few reasons not to go with option 3 (but maybe there are more):
All can be legitimate reasons, of course, but IIUC, the development time would be pretty minimal, and we don't expect a better solution to fall into our laps anytime soon. The last two reasons are likely to be related. (If we thought a better solution was about to fall into our laps, we'd want to hold off so that we don't have to change the format again so quickly afterward.) But this attitude can also be taken to an extreme. It can paralyze us with the fear that something better is always over the horizon, so we never make the easy improvements that we can make.
Ok, I thought of one more potential concern about option 3:
Even though option 3 seems simple, nothing in cryptography is ever really simple. Our proposal is to compute sha512(passphrase + 'hmac')
and feed it to openssl dgst
as a passphrase. There are various problems associated with using a short, static salt (e.g., the birthday attack) and using a standard digest algorithm to hash passwords (instead of a true KDF). Many of these problems are more relevant to protecting online databases than to file encryption at rest, but the point is that we don't know enough about cryptography to know whether a seemingly innocuous move (like adding 'hmac'
as a salt) could somehow decrease security. For example, there are even tricky problems that arise with how you concatenate strings.
In other words, maybe even the simplest option (3) requires more crypto expertise than we have available to us.
The standard advice is to use an actual KDF rather than trying to roll your own. Our problem is that we need something commonly available that can be used from the command-line, and apparently nothing like that is available.
Except that's not entirely true. OpenSSL includes the command passwd
. From the man page:
The passwd command computes the hash of a password typed at run-time or the hash of each password in a list. The password list is taken from the named file for option -in file, from stdin for option -stdin, or from the command line, or from the terminal otherwise. The Unix standard algorithm crypt and the MD5-based BSD password algorithm 1 and its Apache variant apr1 are available.
The problem is that it uses only very old and weak algorithms (md5 and crypt), which would defeat the purpose of passing -md sha256
to enc
, since the initial step would cap user passphrase entropy at 128 bits (and use a weak set of algorithms, to boot).
openssl passwd
by default uses a random salt, so it isn't good for
KDF. There is an option for using static salt (provided on command
line), but then again - we'd need somehow to carry this salt (in backup
header?). The same problem which is solved internally by scrypt
(it
uses own header in encrypted files), or in case of just scrypt
KDF
(connected with openssl enc
) - by storing its parameters in backup
header.
Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
@ag4ve:
Any reason not to just use a 3rd party OSS tool for this like duplicity? Alternatively, I believe rsync and LUKS containers would fit the bill as well.
IIUC, the main problems with those options are:
In other words, maybe even the simplest option (3) requires more crypto expertise than we have available to us.
Clarification: This applies only to the passphrase hashing part. The part about passing -md sha256
to enc
should still be done, if nothing else.
Hey! I use Qubes backups heavily (I'm restoring from a backup as I type this), so I'm willing to provide some free consulting on this issue. I just skimmed this thread, but I understand:
openssl enc
's password-to-key KDF.The main concern I see being brought up is OpenSSL's use of MD5 and that this might cap security at 128 bits. This isn't the right thing to be worrying about. I'm looking at the command line we use for encryption and the source code for openssl enc
which calls EVP_BytesToKey
. According to the algorithm described in this documentation, no entropy would actually be lost assuming MD5 is secure. I am fairly confident that the known weaknesses in MD5 do not significantly decrease security here either. In other words, option 3 of adding -md sha256
to the openssl enc
command gains us negligible security benefits. In either case, a good 256-bit passphrase is still 256 bits of security, not 128. (But using -md sha256
is a good thing to do anyway, just in case.)
I'm concerned about the following things:
openssl enc
should spit out a salt unless you pass the nosalt
option, and then that salt should be provided upon decryption. Could you point me to the code that does this? This is probably the most important thing Qubes needs to be doing. (Edit: Actually I may be misreading the OpenSSL code.)I strongly recommend switching to something with decent key stretching (e.g. at least PBKDF2, or whatever the thing in GnuPG is). I'm willing to provide advice on this. Could someone point me to where the relevant code is (for creating a backup, and then restoring it)? Thanks!
To answer my own concern (1) above, the passphrase effectively isn't protected by a salt. According to these instructions for manual restoration of the backup, there's a file called backup-header
whose contents are the same for all users having the same Qubes version. Inside a backup-header.hmac
file there's the result of running openssl dgst -sha512 -hmac "your_passphrase" backup-header
.
This enables precomputation attacks: An attacker could take the contents of backup-header
for the most common Qubes version and build up a database (or rainbow tables) of what the backup-header.hmac
contents would be under different passphrases. They could then use that database to quickly map the victim's backup-header.hmac
contents back to their password, if it's in the database (or rainbow table).
Edit: use better wording
Strong +1 to option 2 of using scrypt. Speaking as a user, I'm very willing to download and build the scrypt tarball to get the increased security. It's even in the debian repositories.
I use Qubes backups heavily (I'm restoring from a backup as I type this), so I'm willing to provide some free consulting on this issue.
Thank you!
The main concern I see being brought up is OpenSSL's use of MD5 and that this might cap security at 128 bits. This isn't the right thing to be worrying about. [...] In either case, a good 256-bit passphrase is still 256 bits of security, not 128.
Can you clarify that? Suppose my passphrase is a truly random 256-bit string. openssl enc
takes md5(passphrase)
, resulting in a 128-bit string. How have I not lost entropy here?
Here's the backup code: https://github.com/QubesOS/qubes-core-admin/blob/master/core/backup.py
How have I not lost entropy here?
See:
This isn't the right thing to be worrying about.
How have I not lost entropy here?
See:
This isn't the right thing to be worrying about.
These are two logically distinct claims:
md5(passphrase)
doesn't cap entropy at 128 bits.md5(passphrase)
caps entropy at 128 bits, but that's not the right thing to be worrying about.@andrewdavidwong: I'm making claim number (2). md5(passphrase)
certainly caps entropy at 128 bits. openssl enc
is doing something different. What openssl enc
uses as the key is...
md5(passphrase || salt) || md5(md5(passphrase || salt) || passphrase || salt)
...according to the fact it uses EVP_BytesToKey
and these EVP_BytesToKey
docs. (The source code is here but it's pretty terrible to read.). There are probably other problems with this KDF, but it's not obvious that entropy is capped at 128 bits, since the passphrase is used once to make the first 128-bit half and then again in a different way to make the second 128-bit half. The first and second halves are both capped at 128 bits but the end result probably isn't.
Ok, that helps clarify things for me. Thanks!
Hi all. I was also pointed at this thread. I run NCC Group's Cryptography Services practice. (And actually see a reference to NCC earlier - but we never got connected apparently. That's a shame, I'd have been happy to help.)
@defuse is pretty smart, and his statements all seem reasonable to me, but I might reinforce/add to them:
1) Why are you using CBC+HMAC separately? Why not use GCM mode? Even if there's a reason to have the authenticity check separate, I would recommend encrypting with GCM mode anyway to avoid some sort of time-of-check-time-of-use bitflipping attack.
2) One passphrase is used to feed into both the HMAC and the encryption? If the key derivation steps are the same for both, you'd be using the same key for both. That's not ideal. The correct way to derive separate keys from identical input entropy is with HKDF (https://tools.ietf.org/html/rfc5869) which, haha, is not trivial to implement. At the bare minimum, one should do something like HMAC(entropy, "encryption key") and HMAC(entropy, "hmac key") and use the output of those as the respective keys.
That said - this type of academic analysis (why exactly HKDF is good, and HMAC(entropy, "encryption key") or SHA(entropy | "encryption key") are bad) has never been my strong suit. I don't believe there's practical attacks here but... well that's why we go with the safe stuff in crypto.
3) I agree that Key Stretching, and the use of a unique salt when using a passphrase, is crucial for security. Salt to avoid precomputation and stretching to make brute forcing much more painful. PGP's stretching algorithm (S2K Iterations) is an odd standard, but it's a standard. It's about on par with PBKDF2 in terms of security. Scrypt is better - you get (some) memory hardness. I'm not familiar enough with scrypt to recommend parameters though, but I bet you could find someone to help you tweak things to your needs. Maybe a judge or participant in https://password-hashing.net/
Omitting key stretching removes the safety net for users. With key stretching a user with an 'okay' passphrase might be safe, without there's a much higher chance of getting cracked. So perhaps you decide that operating without a safety net is okay for the time being, and you come back to it later - but I wouldn't choose to forgoe it forever.
Thanks, @tomrittervg and @defuse! We fully agree that the system is flawed in the ways you two have helpfully pointed out. Our current problem is that we don't have anyone with the relevant expertise who's willing to help us implement a better system. Is that something that either of you, or anyone you know, would be willing to do?
I'm super busy at the moment, so I can't commit to being able to do anything right now. Using Colin Percival's scrypt tool should be pretty straightforward and hard to mess up, and I'd be happy to review any implementation that comes into existence.
Using Colin Percival's scrypt tool should be pretty straightforward and hard to mess up, and I'd be happy to review any implementation that comes into existence.
@marmarek, what do you think?
Last time I've checked it wasn't easy to provide passphrase from anything but terminal (it was reading it from /dev/tty
). So not trivial to integrate. But probably doable.
Duplicity does not need to have network access to work. It just needs to have a backend specific to inter-VM backup:
SRC RPM for tarsnap (containing scrypt) https://koji.rpmfusion.org/koji/buildinfo?buildID=688
SRC RPM for tarsnap (containing scrypt) https://koji.rpmfusion.org/koji/buildinfo?buildID=688
There is also package for fc23, tagged with "f23-nonfree". I wonder what this means in practice... Will it be in some separate repository?
Nonfree means things built outside Amerika because of intellectual monopoly bullshit.
To the extent that I know, that may just be a miscategorization, as tarsnap is bona fide open source which you can download from the tarsnap site.
If going with scrypt
(which looks like the best option according to above comments), I see it:
.hmac
files, as scrypt
handle this alreadybackup-header.hmac
will make slightly harder to handle both old and new backups with the same tool, but this is still doable.This change makes security of scrypt
tool very critical to security of Qubes backups. Both in terms of used encryption, and correctly handling (potentially malicious) data during decryption.
At the same time we can simplify backup format even more: get rid of inner tar layer. Currently it serves two purposes:
pigz
instead of just gzip
it isn't that bad (but of course still much slower than simply not storing it at all)scrypt
for given file) by expected file name (and some separator, like \x01
). scrypt
authenticate file content using this password, so swapping files should be mitigated here.Attack not mitigated by any of those, is replacing whole VMs with the same VM from older/newer backup (created with the same passphrase). This is not much different than replacing the whole backup archive. Can be mitigated by user by using different passphrases for different backups (like append a number, or a date).
Another reason to drop inner tar layer - it is no longer effective if source isn't sparse file, but LVM thin volume (which will be the case in Qubes 4.0). Actually I haven't found out yet how to create tar archive from block device content, without dumping it to a file first (using python tarfile module isn't any better, as it doesn't support sparse files).
Some benchmark about tar/gzip/pigz:
[user@testvm ~]$ truncate -s 1G sparse.file
[user@testvm ~]$ time tar cS sparse.file |wc -c
10240
real 0m0.041s
user 0m0.002s
sys 0m0.030s
[user@testvm ~]$ time gzip < sparse.file |wc -c
1042069
real 0m15.087s
user 0m14.211s
sys 0m0.881s
[user@testvm ~]$ time pigz < sparse.file |wc -c
1171473
real 0m9.444s
user 0m31.015s
sys 0m2.699s
Attack not mitigated by any of those, is replacing whole VMs with the same VM from older/newer backup (created with the same passphrase). This is not much different than replacing the whole backup archive. Can be mitigated by user by using different passphrases for different backups (like append a number, or a date).
That's a good point (added to documentation).
IMHO, neither variety should be considered a critical attack, since in any case a backup authenticated by the user's passphrase is trusted insofar as the user has chosen to create the backup using that passphrase.
One thing to note is that it's possible to "DoS" someone who uses a different passphrase for each backup (e.g., date appended) by changing around the file names of their backups. This is a different kind of DoS from simply deleting all of their backups, since the victim won't realize what's happened unless/until they try to restore from one of the backups.
Some benchmark about tar/gzip/pigz: [...]
How should these results be read? For example, what's the difference between "real" and "user"?
One thing to note is that it's possible to "DoS" someone who uses a different passphrase for each backup (e.g., date appended) by changing around the file names of their backups. This is a different kind of DoS from simply deleting all of their backups, since the victim won't realize what's happened unless/until they try to restore from one of the backups.
Mostly the same as filling the backup file with junk data...
Some benchmark about tar/gzip/pigz: [...]
How should these results be read? For example, what's the difference between "real" and "user"?
"real" is actual time spent on the operation (end_time-start_time), "user" is CPU time used (including multiple cores etc - 1s of fully utilizing 4 cores is 4s "user" time and 1s "real" time). "sys" is time spent on system calls (kernel code).
@marmarek the gains from sparse file storage aren't so much on the read (backup) side (though, unlike your short benchmark, the are quite big if you have rotational disks like I do). They are mostly on the write side. When your system has to write an enormous file to disk, and allocate those zeroes that should be unallocated, you end up spending a MONSTROUS amount of disk space and disk activity just to store those zeroes. On a system like Qubes, which relies on thin provisioned storage, getting rid of sparse file storage is a bad idea.
Maybe a purpose-built, short C or Go program, that reads from device files and writes tar format to its output, is the right thing to use here. It avoids using tar directly, it can detect rows of zeroes and output them as sparse blocks, and it isn't needed during restore (as you can use tar directly in that case). Those are my thoughts. What do you think?
@marmarek the gains from sparse file storage aren't so much on the read (backup) side (though, unlike your short benchmark, the are quite big if you have rotational disks like I do). They are mostly on the write side. When your system has to write an enormous file to disk, and allocate those zeroes that should be unallocated, you end up spending a MONSTROUS amount of disk space and disk activity just to store those zeroes. On a system like Qubes, which relies on thin provisioned storage, getting rid of sparse file storage is a bad idea.
This isn't a problem. dd conv=sparse
in the middle does the trick.
Maybe a purpose-built, short C or Go program, that reads from device files and writes tar format to its output, is the right thing to use here. It avoids using tar directly, it can detect rows of zeroes and output them as sparse blocks, and it isn't needed during restore (as you can use tar directly in that case). Those are my thoughts. What do you think?
This is what I'm currently exploring, as I've failed to find any other method (tried many tar implementations, other archive formats etc).
Maybe take a look at Go for that custom program. It's batteries-built-in, it's very efficient, it's a safe language. It's got what you need.
Look at what I wrote in it during the past few days: https://github.com/Rudd-O/curvetls .
Come to think of it, the same crypto primitives I am using in the program above (Go's implementation of NaCL secretbox) can be used to seal disk image contents in tamper-resistant containers. You really should check it out — it doesn't have the cache / timing leaks that AES has, it's Salsa and Poly, very good stuff that has a number of implementations and is not known to be weak.
Presumably, the key you pass to secretbox.Seal
would be the output of scrypt's hash function.
Nice, eh?
I'm actually writing two demo programs to explain what I mean. Super simple, for you to read. Gimme 15 more minutes.
There you go: brain dead simple:
https://github.com/Rudd-O/simple-nacl-crypto
The only remaining thing to do, is to write io.Reader and io.Writer that will "packetize" rows of zeroes (as sparse files are wont to contain) and package that data into the secret boxes. It's fairly easy to do, and the Go implementation of files allows seeking, thus it allows constructing sparse files on disk.
Great news:
Though I have to go to sleep and I still need to put a few finishing touches on it, the encryptor and decryptor programs have evolved to pack zeroes (512-byte blocks to be accurate) in a run-length format (that should not be vulnerable to malicious manipulation, because verifiable encryption goes around it).
A file 1 GB in size reduces itself to about 21 bytes. And it's those 21 bytes that get encrypted. No need to do gzip or anything of the sort. Of course, this packing format can in principle be piped to gzip for compression as well.
Naturally, the decoder will use disk seeks to skip writing zeroes as it decodes. This will give us sparse files on decryption for free.
I will finish the decoder for this packing format later today. Right now I must sleep.
Come to think of it, the same crypto primitives I am using in the program above (Go's implementation of NaCL secretbox) can be used to seal disk image contents in tamper-resistant containers. You really should check it out — it doesn't have the cache / timing leaks that AES has, it's Salsa and Poly, very good stuff that has a number of implementations and is not known to be weak.
@Rudd-O please don't. Writing our own program to handle crypto is the last thing we want to do, somewhere near "inventing our own crypto algorithm". Actually this very ticket is result of "being smart" and using openssl enc
/openssl dgst
directly, instead of some higher layer application, designed by experienced cryptographer.
@marmarek I'm not "writing my own crypto". I'm merely writing a program that wraps well-tested cryptography. NaCL's secretbox
is that higher layer crypto (a layer above enc
+ dgst
, with proper authentication and handling of streams), designed by experienced cryptographers. My program only uses it.
Anyway, you should know that the point of these programs isn't to be used as full-fledged backup programs — they are meant as demos within a memory-safe language of (a) crypto box (b) sparse file handling. Much like scrypt enc
and scrypt dec
are meant to be demos of the scrypt hashing algorithm, and they are not meant to be fully-fledged encryption programs. I'm not going to expand them into backup programs.
@Rudd-O But you are inventing own file format. The other problem is introducing new language into code base. While I see why you propose Go, we should stick to those currently used (Python, C). Otherwise maintaining and auditing it would be a nightmare (good luck finding skilled developer fluent in Go, Python, C, Ocaml, Ruby and whatnot). Please don't go offtopic here on advocating why Go is better (even when technically you may be right) - use mailing list for this.
I understand. There seems to be a misunderstanding here.
I'm not saying "use this code as part of the backup for Qubes".
I'm saying that you should look at this a demo — demo, key word, I used it repeatedly — of how:
secretbox
can be used to safely store filesAnyone is 100% free to look at how the code solves these three problems, and write a Python implementation of the same concepts. That can then be used in Qubes.
Note that you are free to use the code directly, if you later change your mind.
Edit: the demo project is done. It encrypts and decrypts files, packing and unpacking sparse files. The code won't let tampered files screw with the computer — security-critical properties are validated before data is parsed. I'm happy with how it turned out. See for yourself:
[user@machine simple-nacl-crypto]$ dd if=/dev/urandom bs=30M of=original count=1 seek=21+0 records in
1+0 records out
31457280 bytes (31 MB, 30 MiB) copied, 2.19849 s, 14.3 MB/s
[user@ring2-projects simple-nacl-crypto]$ encbufsize=1048576 ; decbufsize=1048576 ; make && (bin/nacl-crypt -s -b $encbufsize enc original encrypted abc ; (echo ---------- ; bin/nacl-crypt -b $decbufsize dec encrypted new abc ) ; echo ; (md5sum original new ; ls -la original ; ls -la encrypted ; ls -la new ; du original ; du encrypted ; du new ))
GOPATH=/home/user/Projects/simple-nacl-crypto go install github.com/Rudd-O/simple-nacl-crypto/cmd/`echo bin/nacl-crypt | sed 's|bin/||'`
----------
ae76a32ca5b75f4c4f276a2f08750bc7 original
ae76a32ca5b75f4c4f276a2f08750bc7 new
-rw-rw-r-- 1 user user 94371840 Sep 29 02:02 original
-rw-rw-r-- 1 user user 31458320 Sep 29 02:02 encrypted
-rw-rw-r-- 1 user user 94371840 Sep 29 02:02 new
30720 original
30724 encrypted
30724 new
Here is draft of emergency backup restore v4, which is informal backup format specification. It uses tar for storing sparse files, but encrypted and integrity protected using scrypt
utility.
I dislike the use of tar, to be honest. tar takes forever when reading a file that has sparse sectors, because it has to read the entire file before actually beginning to spit out the data to the calling process. A utility that was written for the purpose, which doesn't have this problem, and made available on Github for emergency restore purposes, should be much better.
I see. As for reading entire file - bsdtar
does it better (for the same file format). But it works only for sparse files, not LVM. Not sure how it works on btrfs. In fact I think it is impossible to effectively get LVM thin volume content (without reading it all). But if possible, it can be implemented in our tool.
And also - as tar
tool can't get block device content, I've written simply python script for it: https://github.com/marmarek/qubes-core-admin/blob/core3-backup/qubes/tarwriter.py
Extraction (either normal, or emergency) is still handled by standard tool. As discussed in this year+ long thread, the best compromise for encryption + integrity protection is to use scrypt
tool, I don't want to reinvent anything here.
On 10/11/2016 10:32 PM, Marek Marczykowski-Górecki wrote:
I see. As for reading entire file - |bsdtar| does it better (for the same file format). But it works only for sparse /files/, not LVM. Not sure how it works on btrfs. In fact I think it is impossible to effectively get LVM thin volume content (without reading it all). But if possible, it can be implemented in our tool. And also - as |tar| /tool/ can't get block device content, I've written simply python script for it: https://github.com/marmarek/qubes-core-admin/blob/core3-backup/qubes/tarwriter.py Extraction (either normal, or emergency) is still handled by standard tool. As discussed in this year+ long thread, the best compromise for encryption + integrity protection is to use |scrypt| tool, I don't want to reinvent anything here.
Qubes OS should also not be using LVM AT ALL. There are no data integrity guarantees with it.
If Qubes OS used btrfs, for example, efficient clones of VMs would be trivial, cp --reflink would work, and FIEMAP (discovery of holes in VM images) would also be implementable.
tar still sucks. The file needs to be read whole because the format requires it upfront.
Scrypt is fine. It's effectively the same thing I am doing with the demo program that I posted above, except it doesn't handle sparse files.
Rudd-O
http://rudd-o.com/
See: https://groups.google.com/d/msg/qubes-devel/CZ7WRwLXcnk/u_rZPoVxL5IJ