Please mark this as a toy meant for learning

LoupVaillant commented 2 years ago

The README starts with the following:

Loket is a very simple yet safe encryptor that only requires a key to work.

I can vouch for the simplicity, but it is unfortunately not safe at all. Having reviewed the source code, I can see it falls prey to the classic problem of repeating keys (you explicitly repeat the key when you construct the fullkey), and your shenanigans with the checksum won't prevent the statistical analysis we can do with a simple repeating key. Even worse, I suspect the checksum may help attackers narrow down the space of possible keys before they even started their analysis. You could of course have a key that's as long as the file itself, but that's neither practical nor (from what I gathered) the initial intent.

So as a quick fix, you should replace the sentence above by something like the following:

Loket is a very simple first attempt at file encryption. It has known vulnerabilities, so don't use it for production.

To actually fix your tool, unfortunately, will requires much more work. You have to address at least two problems:

The repeating key problem must be fixed somehow, so the key stream looks like string of independent random bytes.
You must have a proper authentication tag, that can only be reproduced by people who have the right key.

The only way this will work is by using an official trusted, authenticated encryption mechanism. I personally recommend ChaPoly (Chacha20 + Poly1305), as described in RFC 8439. And to get that, you either need to take on a dependency (given your simplicity goals I'd recommend Monocypher, I wrote it in portable C with zero further dependencies)… or write it from spec, which is doable, though it will require quite a bit of work to get it working properly.

In the mean time, if you're interested in cryptography (and not just C), I highly recommend you take a look at Dan Boneh's cryptography course, or Crypto 101.

Maiori44 commented 2 years ago

I'll mark this program as a toy until I can get it to be more safe. However, I am wondering how the checksum will help an attacker, while yes they could use it to narrow down the number of keys to try, the checksum that is stored in the file is not the correct checksum, so wouldn't they get the wrong keys?

LoupVaillant commented 2 years ago

Presumably, the file the attacker is trying to decrypt has the correct checksum. So they could try a key, and it will only work once every 2^16 tries. Come to think of it however that probably won't help the statistical analysis at the beginning. It's more towards the end, when the key is almost or fully decrypted that you could use the checksum to deduce the missing parts or verify that you found the correct key.

Right now the most devastating attack is a known plaintext attack that leads to full key recovery. Say your file is just a stream of zeroes. Then apart from the checksum, your encrypted file is the fullkey. Take the beginning of the fullkey, and voilà, you've recovered the key.

A slightly more sophisticated attack can work if only parts of the plaintext is known. It is very common for instance for files to begin with the same (or very similar) headers. From that known beginning you can easily deduce parts of the key, or even the whole key, then the fullkey, then decrypt everything.

Note that in real systems, it is okay to guess the key stream (the official name for the fullkey in stream ciphers). What's not okay is deducing the key itself from that key stream. One must be able to generate a key stream from a key, but attackers aren't supposed to be able to take that key stream and recover the key. It must be a one way operation, similar to hashes. Overall key streams must have a number of important properties:

The key stream must look random. like uniformly random: no statistical anomaly, no repetition…
Knowing the key stream must not help recover the key.
Knowing parts of the key stream must not help recover the rest.

Maiori44 commented 2 years ago

I see, in that case I'll think of a better algorithm (maybe even use multiple algorithms?), and then use it for the checksum instead of the original key. Thank you for the explenation.

LoupVaillant commented 2 years ago

I'll think of a better algorithm

Note that in practice, coming up with an algorithm on your own is impossible (and the whole reason why everyone is chanting "don't roll your own crypto" all the time). More precisely, there's no way someone isolated can even hope to approach the performance and security of the state of the art. Not even world class researchers. To come up with a new cipher or hash, you pretty much have to study a gazillion things about cryptanalysis so you know how to find flaws in your own design, then you present it to peers… and eventually a number of years later you may end up with something worth using.

Now there are simple algorithms even beginners may come up with, but they're rarely practical (the one time pad for instance involves keys that are used only once, and are as long as the message). What you really want is a short key that can safely encrypt & authenticate many messages. The crux is generating an unlimited number of random-looking key streams from a single key, and that's bloody hard.

It will be obvious from a couple hours of reading introductions to cryptography that it's easier to implement an existing algorithm instead. One way to keep things minimal is using a hash function to generate the key streams. That same hash can be used to authenticate your messages as well. For all this, Blake2 is a good start. (In fact, any hash that is deemed "secure" is good enough". Just be careful about length extension attacks.)

Maiori44 commented 2 years ago

I guess that makes sense, although it's kinda disappointing.

LoupVaillant commented 2 years ago

Well… the depressing part mostly goes for symmetric ciphers and hashes. Some other stuff we can actually invent ourselves (though it's very rarely needed):

Polynomial hashes are one-time-hashes you can use for authentication. They're faster than regular hashes, and their security can be proven. Those proofs are the reason why you may invent your own. Coming up with a security reduction isn't trivial, but at least you can achieve certainty. They do have a big usability problem however. Use them wrong, and all your security goes to shit. Then again, wrapping them up in a nice authenticated encryption API solves the usability problem.

The depressing part with them however is that GHASH and Poly1305 already covers most of the world's needs. GHASH is based on binary fields (a type of Galois field), and is very hardware friendly. On the other hand regular CPUs often lack the instructions needed to make it fast. Poly1305 on the other hand is based on prime fields, and is very fast on a wide range of CPUs without needing special instructions.

Elliptic curves can be found depending on what you want. Their security is more subject to debate, but if you follow the safe curves criteria you can tweak the parameters to what you want. Again, Curve25519 (and to a lesser extent Curve448) cover most of the world's needs, but I believe there's a bit more wiggle room for improvement for some special applications (and finding such applications is an area of active research, there's more than key exchange and signatures out there).

Constructions such as authenticated encryption, authenticated key exchange etc. builds upon the depressingly hard to design primitives, and you can prove that if the primitives are secure, then so is the construction. This is delicate stuff, but definitely not out of reach. Also, there's even more room for improvement here, in some areas I believe we can still improve the state of the art (make things marginally simpler or faster).

If your goal is just to learn, and you don't care about improving the state of the art, polynomial hashes and elliptic curves are certainly things you can find/design safely on your own, provided you come up with rigorous mathematical proofs that you did the right thing. You'll still need some peer review just to be sure, but unlike ciphers and hashes you won't require world class expertise.

Maiori44 commented 2 years ago

Let me see if I got this right, I can create a polynomial hash and use it for the fullkey, am I correct?

LoupVaillant commented 2 years ago

Actually you should authenticate the ciphertext. Meaning the combination of your plaintext and your fullkey. And there are other conditions to watch out for, such as never using the same key twice, which forces you to generate a fullkey that’s as long as the message plus the length of the authentication key.

Maiori44 commented 2 years ago

Ah, so the hash is to replace the checksum?

LoupVaillant commented 2 years ago

Yes, exactly.

Maiori44 commented 2 years ago

Alright, great, now what should I use as an algorithm for encrypting though?

LoupVaillant commented 2 years ago

I have two recommendations:

Chacha20. It’s one of the simplest secure stream cipher out there, if not the simplest. And in many circumstances it’s even faster than AES. It’s main problem is that it’s only a stream cipher, so you’ll need to implement Poly1305 or a more generic hash function like Blake2b to generate the authentication tag.
Blake2b. It’s just a hash, but there is a way to turn hashes into a key expansion mechanism (and therefore a stream cipher): chop up your plaintext message in blocs that are as long as your hash (64 bytes with Blake2b), and for each block, hash together your encryption key, some unique message number (the nonce), and a block number (the position of the message block). Then just XOR the result with your plaintext message. Because of the properties of hash functions, this is a secure way to expand your key into a good fullkey. In fact, this is how Daniel J. Bernstein demonstrated how you could turn a hash function into a cipher, thus destroying US legislation on cipher exportation (block ciphers were restricted, hashes were not). And once you do that, use Blake2b again (this time on the encrypted message & key) to generate the authentication tag.

Method (1) is more complicated, and Poly1305 specifically is not trivial to get right (depending on your method, subtle, hard to test errors may slip in). Great learning tool though, and RFC 8439 is basically the state of the art in authenticated encryption. That’s what I’d recommend for most production systems.

Method (2) will work slower than method (1). If you care about performance this might be a problem. However you only need to implement one easy to test primitive, which should shrink your code quite a bit. If simplicity is what you want to optimise for, this is the way to go.

Maiori44 commented 2 years ago

I will probably use the simpler method, but I'll have to think about it.

Maiori44 / Loket

Please mark this as a toy meant for learning #1