IreneKnapp / modern-data

A self-describing binary data format for dependently-typed object graphs.
MIT License
13 stars 0 forks source link

Hash function #4

Open IreneKnapp opened 9 years ago

IreneKnapp commented 9 years ago

The original design choice was to use a fast function rather than a cryptographically secure one. The creator of the Murmur 3 function put a considerable amount of effort into making accidental collisions quite unlikely, without compromising performance, and I think of that category it's the best choice.

It's worth revisiting the question though. At the very least, there needs to be thought about how severely the system breaks down in the event of a collision, and whether it's feasible to migrate to a new function in future if the choice turns out to have been a bad one.

One future-proofing question is: If it becomes necessary to put a hard version number on the schema format, is it possible to migrate old ones forward? When a schema can be completely loaded, both new and old hashes are always known so they can just be substituted. Are there any important use-cases where a schema file might be incomplete, perhaps because it depends on something else? Obviously the schema format itself explicitly does not have any notion of external resources, but it's meant to allow systems wrapped around it to handle that. It's probably the case that those (completely hypothetical) systems will always have all the dependencies they need, by design... right?

I think this is in the category of worrying too much, but I want to at least document the decision.

tinyplasticgreyknight commented 9 years ago

If there's at all a possibility of the format having multiple versions some day, should there be some sort of node for expressing the version directly? Or should that be out-of-band information?

IreneKnapp commented 9 years ago

Yes, not a node but a stream event - the magic number will convey this information. The stream is supposed to start with it, but it has no meaning in the term language so it doesn't become a node. It's indicated via the magic_number callback in struct modern_stream, although right now that doesn't pass a version or anything, and it probably should.

IreneKnapp commented 9 years ago

As a further thought, the use of hashes to refer to values that have previously been brought into scope is redundant, since everything can be done with let-nodes, and it violates referential transparency.

Removing the name_value and name_type nodes would mean that there's no mechanism for referring to anything that isn't lexically visible via a parent scope, ... and that's perfectly fine. It could naively result in needing to repeat identical node-structures a lot, but this could be addressed by passing around first-class module objects, like records but the fields are bits of schema structure rather than terms.

Having first-class modules means having first-class schema structure, which perhaps means adding a "quote" node? cf. #5 for issues with first-class schema structure, including a discussion of the tradeoff and how there's already a need to find a way to make it work that lets parametricity be relied on when it's needed, even with the status quo, which is not quite first-class.

There is one other use of hashing, not in the design at this moment but potentially quite nice, which is as an abbreviated way to have a file that omits the schema under the assumption that it's already known and there's no need to repeat it, or to refer to bits of schema in a network protocol to ask "do you know this, or should I tell you". So, having a specified hash function might still be nice. Thinking of that as the only application does make the security impact clearer, though...

Imagine having user-provided data that asserts it conforms to "the schema matching this hash". You still wouldn't trust the data to be valid, because of the security implications; you'd run the type checker before accessing it (actually as part of accessing it, that's how the stream processors are meant to work). So if an attacker somehow has a malicious schema that has a hash collision with one you trust, you haven't lost a security property just by reading data.

There is, however, an issue if you're caching bits of schema keyed by their hashes, and could be tricked into using a wrong one for your own data, via cache-poisoning. This is not something Modern Data directly implements, but it's a use-case that it's always meant to support. Obviously, hash functions don't last forever, so it's very clear that that mechanism should be out-of-band, not part of the schema language.

So: No hashes! Except for looking up symbol names for the explicatory processor, of course, which never exposes the properties of the hash function and should therefore be totally fine. Resolved to delete all the name/hash features. Leaving the issue open until that's done. :)