Closed infinity0 closed 2 years ago
So far, I haven't looked specifically to security issues.
If you have some specific attack/test case to suggest, I would be very interested to look into it further.
Generally speaking, I suspect that Flat should be fairly resilient to attacks for two reasons:
A) Every input is a valid value.
Consider: data TertiumDatur = False | True | Maybe
This type has only 3 valid values but in all byte-oriented serialisation protocols, it would be serialised as a byte, leaving 253 potential invalid values.
In Flat, that has an optimal bit-oriented encoding, False would be 0, True would be 10 and Maybe would be 11. There are NO invalid values that can be read in.
B) When it comes to data structures, Flat will encode them either as lists, one item at the time, or as arrays that are lists of chunks of up to 256 elements at most, making it harder to play tricks with memory.
There are two broad classes I can think of right now - invalid data attacks and resource usage attacks.
The example you gave is a short-range invalid data attack which is easy to detect - although I see no difference between having 253 invalid values that you give a DecodeFailure for, vs 1 invalid value - what do you do with b"11"
in your example, for example? OTOH there are also long-range invalid attacks involving assumptions the library might make across more of the data. For example in the links above the issue was, when encoding a set as a sorted stream, during deserialisation one must check that the stream of elements is actually sorted.
Then there are also resource usage attacks, generally it shouldn't be possible to represent an extremely huge data structure in a small amount of serialised bytes, e.g. https://brhfl.com/2017/11/svg-bomb/ - the processing time and space complexity should be proportional to the size of the input. Another example is exploiting a predictable hash function of a hash table to purposefully make it run in O(n) worst-case time for every operation on every single element; one defense here would be to serialise into some predictable canonical order unrelated to the hash function, but choose the hash function non-deterministically at runtime, and deserialise the canonical ordering as if the user had called the main insertion API.
In the case of TertiumDatur the decoding logic would be: if nextBit==0 then False else if nextBit==0 then True else Maybe If nextBit then error "TooMuchData"
So there is really no way to create an invalid value, by design.
Though obviously invalid values are easy to check for also in a byte-oriented system.
Tipically, the exchange format type system is less expressive than the programming language where the data originates from so it cannot enforce property like sorting or hashing properties, this is entirely the responsibility of the upper language layer and, once defined at that level, could be easily ported across exchange/storage systems.
It's a bit like asking if a file system can enforce the property that haskell source files should be correctly formatted. It can't, to a file system a source file it's just a chunk of bytes.
In practice naturally, as the two steps of mapping and exchanging are conflated in the exchange format code it makes sense to ask if the conversion is dealt with correctly.
I will check to see if there are any obvious vulnerabilities in this area.
I am afraid however that flat might be vulnerable to a simpler attack: just a very long input stream that causes an out of memory error.
It would be useful to have some specific test cases, though.
If you have any concrete example of potentially vulnerable data structure in mind, that would be a great starting point.
Is your interest academic or are you selecting a safe data exchange library for some real-world application?
Great, thanks for the clarifications.
this is entirely the responsibility of the upper language layer and, once defined at that level, could be easily ported across exchange/storage systems.
It's a bit like asking if a file system can enforce the property that haskell source files should be correctly formatted. It can't, to a file system a source file it's just a chunk of bytes.
I agree with this for the low-level serialisation framework itself, though these libraries typically also contain instances for common data structures like container - e.g. flat seems to do this OK for Set, and an example of a rejected / insecure approach would be this one.
I am afraid however that flat might be vulnerable to a simpler attack: just a very long input stream that causes an out of memory error.
I think it's not the responsibility of the serialisation framework to protect against this - if the input stream is long, a higher layer can detect this and enforce a limit, e.g. by dividing the input into chunks that can be deserialised individually. However if the stream is say 4KB yet the serialisation framework expands this to use 16GB of memory, then this is not something that higher layers can easily protect against. In general I'd say a linear expansion relative to the input, is acceptable.
I guess that would require a way to create a loop by referring back to a previous point. I don't see how that could work in the context of a simple ADT encoding system as flat.
This sounds safe yes. If there is no way to refer to other parts of the ADT, then the structure must be a tree and therefore acyclic.
Is your interest academic or are you selecting a safe data exchange library for some real-world application?
I'm selecting something for real-world use cases. I'll have to look at the format detail in more depth, to come up with specific examples - I don't think it's easy to construct general concrete examples across all serialisation frameworks, although I attempted to describe some general abstract issues. But it sounds like you would treat those cases as bugs to be fixed, which is at least encouraging - e.g. the store library makes an explicit design choice not to cover such issues.
Sure, I would be glad to make 'flat' safer.
Let me know when you come up with some examples we can work on.
Feel free to reopen this issue if you want to discuss it further.
Hi, in well-typed/cborg#230 I'm comparing various high-level characteristics of Haskell serialisation libraries. One aspect is security - for example I've seen several libraries reject optimisations for security reasons - 1, 2 - what tradeoffs did you make for this library?