Document security design

infinity0 commented 4 years ago

Hi, in well-typed/cborg#230 I'm comparing various high-level characteristics of Haskell serialisation libraries. One aspect is security - for example I've seen several libraries reject optimisations for security reasons - 1, 2 - what tradeoffs did you make for this library?

tittoassini commented 4 years ago

So far, I haven't looked specifically to security issues.

If you have some specific attack/test case to suggest, I would be very interested to look into it further.

Generally speaking, I suspect that Flat should be fairly resilient to attacks for two reasons:

A) Every input is a valid value.

Consider: data TertiumDatur = False | True | Maybe

This type has only 3 valid values but in all byte-oriented serialisation protocols, it would be serialised as a byte, leaving 253 potential invalid values.

In Flat, that has an optimal bit-oriented encoding, False would be 0, True would be 10 and Maybe would be 11. There are NO invalid values that can be read in.

B) When it comes to data structures, Flat will encode them either as lists, one item at the time, or as arrays that are lists of chunks of up to 256 elements at most, making it harder to play tricks with memory.

infinity0 commented 4 years ago

There are two broad classes I can think of right now - invalid data attacks and resource usage attacks.

The example you gave is a short-range invalid data attack which is easy to detect - although I see no difference between having 253 invalid values that you give a DecodeFailure for, vs 1 invalid value - what do you do with b"11" in your example, for example? OTOH there are also long-range invalid attacks involving assumptions the library might make across more of the data. For example in the links above the issue was, when encoding a set as a sorted stream, during deserialisation one must check that the stream of elements is actually sorted.

Then there are also resource usage attacks, generally it shouldn't be possible to represent an extremely huge data structure in a small amount of serialised bytes, e.g. https://brhfl.com/2017/11/svg-bomb/ - the processing time and space complexity should be proportional to the size of the input. Another example is exploiting a predictable hash function of a hash table to purposefully make it run in O(n) worst-case time for every operation on every single element; one defense here would be to serialise into some predictable canonical order unrelated to the hash function, but choose the hash function non-deterministically at runtime, and deserialise the canonical ordering as if the user had called the main insertion API.

tittoassini commented 4 years ago

Regarding the absence of invalid values, I agree that is a minor point but it is a valid one.

In the case of TertiumDatur the decoding logic would be: if nextBit==0 then False else if nextBit==0 then True else Maybe If nextBit then error "TooMuchData"

So there is really no way to create an invalid value, by design.

Though obviously invalid values are easy to check for also in a byte-oriented system.

Regarding the larger problem of violated assumptions in decoded data, such as sorting, I think that is useful to remind that in 'flat', as in any data formats, data transfer is, at least conceptually if not always in practice, a two steps process: A) The original data type is mapped to a data type supported by the exchange format (e.g. a map might be mapped to a list of tuples, or an ordered set to a simple list) B) The exchange format type is encoded and decoded

Tipically, the exchange format type system is less expressive than the programming language where the data originates from so it cannot enforce property like sorting or hashing properties, this is entirely the responsibility of the upper language layer and, once defined at that level, could be easily ported across exchange/storage systems.

It's a bit like asking if a file system can enforce the property that haskell source files should be correctly formatted. It can't, to a file system a source file it's just a chunk of bytes.

In practice naturally, as the two steps of mapping and exchanging are conflated in the exchange format code it makes sense to ask if the conversion is dealt with correctly.

I will check to see if there are any obvious vulnerabilities in this area.

Regarding the resource usage attack based on some kind of combinatorial explosion, I guess that would require a way to create a loop by referring back to a previous point. I don't see how that could work in the context of a simple ADT encoding system as flat.

I am afraid however that flat might be vulnerable to a simpler attack: just a very long input stream that causes an out of memory error.

This is all quite interesting, and I am quite willing to investigate it further.

It would be useful to have some specific test cases, though.

If you have any concrete example of potentially vulnerable data structure in mind, that would be a great starting point.

Is your interest academic or are you selecting a safe data exchange library for some real-world application?

infinity0 commented 4 years ago

Great, thanks for the clarifications.

this is entirely the responsibility of the upper language layer and, once defined at that level, could be easily ported across exchange/storage systems.

It's a bit like asking if a file system can enforce the property that haskell source files should be correctly formatted. It can't, to a file system a source file it's just a chunk of bytes.

I agree with this for the low-level serialisation framework itself, though these libraries typically also contain instances for common data structures like container - e.g. flat seems to do this OK for Set, and an example of a rejected / insecure approach would be this one.

I am afraid however that flat might be vulnerable to a simpler attack: just a very long input stream that causes an out of memory error.

I think it's not the responsibility of the serialisation framework to protect against this - if the input stream is long, a higher layer can detect this and enforce a limit, e.g. by dividing the input into chunks that can be deserialised individually. However if the stream is say 4KB yet the serialisation framework expands this to use 16GB of memory, then this is not something that higher layers can easily protect against. In general I'd say a linear expansion relative to the input, is acceptable.

I guess that would require a way to create a loop by referring back to a previous point. I don't see how that could work in the context of a simple ADT encoding system as flat.

This sounds safe yes. If there is no way to refer to other parts of the ADT, then the structure must be a tree and therefore acyclic.

Is your interest academic or are you selecting a safe data exchange library for some real-world application?

I'm selecting something for real-world use cases. I'll have to look at the format detail in more depth, to come up with specific examples - I don't think it's easy to construct general concrete examples across all serialisation frameworks, although I attempted to describe some general abstract issues. But it sounds like you would treat those cases as bugs to be fixed, which is at least encouraging - e.g. the store library makes an explicit design choice not to cover such issues.

tittoassini commented 4 years ago

Sure, I would be glad to make 'flat' safer.

Let me know when you come up with some examples we can work on.

tittoassini commented 2 years ago

Feel free to reopen this issue if you want to discuss it further.

Quid2 / flat

Document security design #16