There should be a standard way to recusively hash structs

pedrocr commented 7 years ago

std::hash::Hasher can be derived for structs and is a standard hashing interface in rust. The standard interface only allows 64bit outputs but there's nothing stopping extra outpust tailored to specific hashes. So for ergonomic purposes wouldn't it make sense to have an adapter to allow using the Hasher API?

burdges commented 7 years ago

I'd consider this unwise based on previous discussions.

Right now, std::hash::Hasher does not address endianness, making it non-portable. It's okay for HashMap because you can only serialize HashMap as a series of tuples, not as the raw table. It's already not okay for more lossy probabilistic data structures, like a Cuckoo table or Bloom filter, that you likely serialize as the table itself. It's flat wrong for cryptographic hash functions that one uses almost exclusively in protocols or files.

Conversely, all these cryptographic hash functions would destroy the performance of HashMap if used there, so avoid doing that too. SipHasher is already designed to be a sweat spot between being fast and providing the sort of cryptographic assurances HashMap needs when under DoS attacks.

That said, one could provide a version using roughly the endian-hasher crate, unless its horribly broken.

use std::hash::{Hash, Hasher};
use endian_hasher::*;

struct DigestHasher<'a, T: 'a + ?Sized>(&'a mut T);

impl<'a, T: ?Sized + Input> Hasher for DigestHasher<'a, T> {
    fn finish(&self) -> u64 {  panic!();  }
    fn write(&mut self, bytes: &[u8]) {  self.0.process(bytes);  }
}

pub trait Input {
    /// Digest input data. This method can be called repeatedly
    /// for use with streaming messages.
    fn process(&mut self, input: &[u8]);

    /// Provide data from anything that implements `Hash`, attempting to converting embedded primitive numeric types to little endian.
    fn input_le<H: Hash>(&mut self, hashable: &H) {
        let mut s = HasherToLE(DigestHasher(self));
        hashable.hash(&mut s);
    }

    /// Provide data from anything that implements `Hash`, attempting to convert embedded primitive numeric types to big endian.
    fn input_be<H: Hash>(&mut self, hashable: &H) {
        let mut s = HasherToBE(DigestHasher(self));
        hashable.hash(&mut s);
    }

    /// Provide data from anything that implements `Hash`, leaving primitive numeric types in native endianness.
    ///
    /// Warning: Non-portable, do not use without first correcting the endianness of the input.
    fn input_native<H: Hash>(&mut self, hashable: &H) {
        let mut s = DigestHasher(self);
        hashable.hash(&mut s);
    }
}

As previously, we'd keep DigestHasher private to the digest crate so that it cannot be used with HashMap. And we'd make the finish method panic for good measure.

There is still an issue here in that we trust H: Hash to handle primitive numeric types by calling Hash instances recursively, which it might fail to do. We could lay the blame for such portability bugs with whoever wrote H: Hash perhaps, but they sound tricky to find.

I could hack up a pull request for this if @newpavlov thinks it wise. I think the first question is : Should the endian-hasher crate live or die?

burdges commented 7 years ago

I'm tired right now, but I think argument against my alternative goes roughly : You should document your protocols and file formats better by using a more explicit serialization. That could just be a warning somewhere.

tarcieri commented 7 years ago

Sorry to be this blunt, but this is a terrible idea.

The standard interface only allows 64bit outputs but there's nothing stopping extra outpust tailored to specific hashes

What? The trait definition stops this, because types. Nor is there any standard notion of how to map a cryptographic hash function's digest onto a 64-bit integer.

But beyond that, cryptographic hash functions utilized by std::hash::Hasher should be PRFs (i.e. keyed by a secret unguessable by an attacker), because hashDoS. There is no point in using a hash function which is both slow (because crypto) and not a PRF (because security).

The fundamental tradeoff is:

1) A hash function that is fast as possible but doesn't hash attacker-controlled data. This omits every cryptographic hash function, because they are slow because they act as random oracles. 2) A cryptographic PRF keyed by unguessable data to prevent hashDoS.

To me, if your goal is #2, anything slower than SipHash is unacceptable, because you still want a std::hash::Hasher to be as fast as possible given other constraints.

Do you disagree? Can you explain the use case where you want something that's simultaneously slow and insecure?

pedrocr commented 7 years ago

@burdges the endianess issues are indeed a problem. If there's no way to fix those it's unfortunate. I'm not sure if I understand why this happens. Are primitive types hashing by feeding the hasher one u8 at a time? Because the Hasher interface includes everything from u8 to u128 so it would seem that as long as Hash implementations use the full type it should be fine. Why doesn't that work?

I'll have a look if your suggestion works for my use case.

@tarcieri What? The trait definition stops this, because types. Nor is there any standard notion of how to map a cryptographic hash function's digest onto a 64-bit integer.

Nothing stops me from doing a .finish_256bit() and have that give me SHA256 of my struct. It's not part of the trait but I don't care I just want to be able to use the Hash API as the plumbing to get all the values to the Hasher.

You're assuming I want to use the normal Hasher API with SHA256 to use in a Hashmap or something like that. That would be a bad idea indeed, SipHash is strictly better in that application. What I need is a longer hash that I can use to uniquely identify a certain struct that I can use over time to identify that struct for caching and serialization. Think of how git uses hashes, content addresses, not how hash tables use hashes. Here's my current code:

https://github.com/pedrocr/rawloader/blob/c0a017590cc94d21416cbe4f4fe1fce84f4daaf6/src/imageops/mod.rs#L206-L225

I generate hashes that represent the state of the image pipeline up to that point as a function of it's settings. I'm currently using MetroHash but I want at least 128bit and a crypto hash for this. I'm only using the Hash trait for convenience of #derive(Hash) and friends. But maybe I just need a #derive(CryptoHash).

burdges commented 7 years ago

It worse so long as people use Hasher recursively, but if any impl of Hash casts a &[u64] to a &[u8] then it'll break. And someone might wind up doing this for performance if they mistakenly examine performance in unoptimized builds.

You can always locally employ a version of DigestHasher wrapper that makes a cryptographic hash function look like a Hasher, either with or without my endianness fix. I think the question is if my input_le and input_be functions make sense.

In fact, I doubt #[derive(Hash)] has stabilized field order, which provides a much better argument against adding my input_le and input_be functions to the Input trait. You could warn people that they should implement Hash themselves, not derive it.

I think those are the arguments against input_le and input_be : Bad impls for Hash break them. #[derive(Hash)] might not have stabilized field order. And you can document your protocol better by using a more explicit serialization. Iffy but not necessarily fatal. I donno.

pedrocr commented 7 years ago

@burdges It seems the standard implementation does just that:

https://github.com/rust-lang/rust/blob/c6afde6c46d167c9c75389b887f1d2498aeef3e4/src/libcore/hash/mod.rs#L506-L510

So it seems Hash is just too fundamentally broken for the serialization use case. Is there any derivable trait somewhere that can help with that? Something that just recursively hashes the underlying bytes of all types which a fixed endianness would be enough.

burdges commented 7 years ago

Yikes, I completely missed that! I'm tempted to raise that as a libs issue. In any case, I suppose serde should be considered the right way to do this for now.

pedrocr commented 7 years ago

@burdges That's very interesting how would you use serde for this? All my structs already implement Serialize and Deserialize. There's a standard way to just feed that to SHA256 without first going to YAML or something like that?

tarcieri commented 7 years ago

@pedrocr perhaps something like this is what you want:

https://github.com/cryptosphere/objecthash-rs

Note that I never wrote a procedural macro for it. I also intend on doing something slightly different and a bit more fleshed out soon.

pedrocr commented 7 years ago

@tarcieri yes, this is precisely my use case, thanks. Wouldn't it be easier to just write a serde Serializer that feeds a Digest? That way #derive(Serialize) is all that's needed and any crypto hash can be used?

tarcieri commented 7 years ago

@pedrocr it may be possible to finagle it through serde visitors.

I would advise against using it for now though. I'll have a better library out soon.

pedrocr commented 7 years ago

@tarcieri it seems objecthash solves a bigger problem of having redactable structs. For me that's not an issue so maybe just writing a Serializer is simpler.

burdges commented 7 years ago

Yes, even using the bicode crate still involves an allocation, but one could write a Serializer for digests to avoid allocation, maybe even one that agreed with the results of bincode.

As an aside, I'm pretty happy doing all my cryptographic hashing manually so far because frequently if I need a cryptographic hashes then it actually needs data from multiple sources or maybe should not cover all the data present in the struct provided. YMMV

tarcieri commented 7 years ago

@pedrocr as is, the Rust implementation doesn't support redaction, but yes, redaction/Merkle inclusion proofs are one of the nice features you get out of a hashing scheme like that.

pedrocr commented 7 years ago

@burdges cool, I'll have a look at this then. Would a PR with something like an adapter between Serializer and Digest make sense?

As for doing it manually I was hoping to do it automatically as these structs are specifically designed to be all the settings for the image operation and nothing more. We'll see :)

burdges commented 7 years ago

I found an easy way : You impl io::Write for a local wrapper struct Inputer<'a, I: 'a + ?Sized>(&'a mut I) where I: Input; and provide a method with a Serialize argument that calls bincode::serialize_into the Inputer wrapper.

It'll look vaguely like what I wrote above, but using io::Write instead of Hasher and serde::Serialize instead of Hash. Now serialize_into should avoid the Vec allocation in bincode::serialize.

Also, your original question could now be : Why not implement io::Write? Or could Input be replaced with io::Write even?

pedrocr commented 7 years ago

@burdges sounds good, avoids having to write a full Serializer. Does do more indirection though.

pedrocr commented 7 years ago

@burdges also is bincode guaranteed to always send the same bytes? Or can it do things like reorder fields between versions? It may be safer to just implement the Serializer to guarantee no funny business happens that invalidates the hash.

tarcieri commented 7 years ago

@pedrocr you seem to be running afoul of the canonicalization problem. One of the nice things about objecthash-like schemes is you don't need to canonicalize prior to hashing. Structurally similar objects will always have the same hash.

pedrocr commented 7 years ago

@tarcieri see this for my conclusions on the topic:

https://internals.rust-lang.org/t/f32-f64-should-implement-hash/5436/33

My worry in this case is only that the same exact struct will hash differently because bincode changes. Making sure that reordering fields doesn't change the hash for example isn't as important. Since my structs already implement Serialize using that would mean one less derive to worry about.

Having a crate that does derive and allows me to use any hash would be nice though so please do ping me once you have something to test. The other features can't hurt either.

burdges commented 7 years ago

Yes, serde explicitly supports non-self-describing formats, naming bincode as the example.

There are various instructions for impls of Serialize and Deserialize to support non-self-describing formats, so those can break bincode of course.

Also bincode never emits any thing about the fields besides lengths and contents. I'd imagine bincode would use a feature or a major version update if they broke binary comparability, but who knows.

pedrocr commented 7 years ago

@burdges so hashing bincode does seem fairly reasonable. Will have a look at that then.

newpavlov commented 7 years ago

@pedrocr Sorry I couldn't join discussion earlier. As I understand, your problem can be solved by a separate crate (bincode + some glue code or using future tarcieri's crate) and you do not currently require any changes in the digest? In that case I think this issue can be closed.

burdges commented 7 years ago

There is a lingering question as to whether the Input trait should be replaced by std::io::Write or something like that. I haven't thought about it really myself.

newpavlov commented 7 years ago

@burdges There is two problems with using std::io::Write. First: there is no core::io, so it conflicts with the goal to support no_std whenever possible. Second: its methods return Result while hashes can process any data without an error, so using Write will result in a lot of useless unwraps.

I think the current digest_reader is more than enough for convenient interfacing with the Read implementors.

pedrocr commented 7 years ago

It would also be nice if a Serializer was written so that one could hash straight from serde without using bincode. But that would just be a convenience that should probably be behind a feature flag.

tarcieri commented 7 years ago

Should this issue be closed? I think it's semantically incorrect to implement Hasher for this purpose, and there are plenty of other APIs more appropriate for hashing structured data in a cryptographically meaningful way.

pedrocr commented 7 years ago

@tarcieri the core of the issue is still unresolved. I've changed the title to reflect that. Right now the best way to use crypto hashes on structs is serde->bincode->Digest which is far from ideal.

newpavlov commented 7 years ago

@pedrocr Am I correct that you want to add the following method to Digest trait or something similar?

fn digest_serialized<S: Serialize, F: Serializer>(s: S, f: F) -> Result<Output<Self::OutputSize>, F::Error>;

UPD: edited code a bit, removed bincode method

pedrocr commented 7 years ago

@newpavlov Don't know why you'd want the bincode one, the idea would be to not need bincode at all to be able to hash something that implements Serialize. I also don't think the first one is the correct one. From what I've seen what's needed is some sort of DigestSerializer that implements the Serializer trait from serde by feeding all the inputs into a Digest.

pedrocr commented 7 years ago

Also I'm not fully convinced the correct solution for this problem is with serde at all. There very likely cases where you want to serialize a field that you don't want to include in the hash and vice versa.

newpavlov commented 7 years ago

@pedrocr To hash data you first need to convert it to the stream of bytes, i.e. serialize it, I don't think it will be right to choose or to develop serializator inside digest crate.

So I think it's better to start from the separate crate which will utilize Digest trait (or just one hash) and some serialization strategy (be it bincode or something hand-written) to implement digest version of Serializer.

I don't mind to have this crate as part of the RustCrypto project, but I'll not be able to work on this crate as currently I have other priorities for the project.

P.S.: After reading some docs it seems that code I've wrote in the previous comment will not work, as it's impossible to extract Writer from serializer and there is currently no traits for describing what we need.

tarcieri commented 7 years ago

As someone working specifically on structured hashing schemes, I'd also suggest trying to collaborate on ones which are reusable across multiple languages instead of making yet another one which is proprietary to Rust.

pedrocr commented 7 years ago

@newpavlov To hash data you first need to convert it to the stream of bytes, i.e. serialize it

Not exactly, a serialization requires the possibility to deserialize and thus includes more bytes than strictly needed. A Serializer that just feeds a Digest is by definition one-way and thus the only thing it needs to do is convert all fields to bytes in order and feed the hash, it doesn't have to do much serialization. But I agree that Serialize is the wrong abstraction for this as you will not always want to serialize the same fields as you want to hash and vice versa.

So I think it's better to start from the separate crate which will utilize Digest (or just one hash) and some serialization strategy (be it bincode or something hand-written) to implement digest version of Serializer.

Hashing the bincode serialization is trivial to do, that's what I've already done in my project. Publishing that as a mini-crate would be easy if there's interest.

@tarcieri As someone working specifically on structured hashing schemes, I'd also suggest trying to collaborate on ones which are reusable across multiple languages instead of making yet another one which is proprietary to Rust.

Structured hashing schemes are not the same thing as hashing structs. It would be nice if Hash was usable for hashing structs independently of structured hashing schemes. Since that is broken in multiple ways it would probably be nice if there was a usable #[derive(CryptoHash)] somewhere. But since serde->bincode->Digest works well enough it's probably what we'll end up with.

tarcieri commented 7 years ago

Structured hashing schemes are not the same thing as hashing structs.

"Recursively hashing structs" is pretty much the definition of a structured hashing scheme. However, you seem to be using it to mean "a content hash for serialized bincode" which, to me, is something very different from "recursively hashing structs"

It would be nice if Hash was usable for hashing structs independently of structured hashing schemes.

As covered earlier, this is simply the wrong tool for the job.

Since that is broken in multiple ways it would probably be nice if there was a usable #[derive(CryptoHash)] somewhere. But since serde->bincode->Digest works well enough it's probably what we'll end up with.

So you want to invent your own scheme, which is what I was recommending against. At least think about taking one of these two paths as you invent your own scheme:

1) Canonicalization: how do you deal with hashing str and String? (i.e. do you apply Unicode canonicalization first?) Someone will want to hash types like HashMap eventually. Make sure there's a canonical representation. 2) Structured hashing: nicely sidesteps the canonicalization problem while providing a bunch of added value at the cost of relatively little complexity. These schemes can be as fast or faster than canonicalization schemes, because they avoid serializing the data as an intermediate step.

Done correctly you can solve all of the problems I just mentioned and have a format that works across languages. Or you can invent your own thing that's specific to Rust, ala bincode, but please think through the issues that have been addressed in all of the content hashing schemes that have been created in the past.

You are walking well-worn ground here, where many mistakes have been made by past naive attempts and much hard-won knowledge has been accrued. Before you go invent something, please make sure you study the past.

pedrocr commented 7 years ago

@tarcieri "Recursively hashing structs" is pretty much the definition of a structured hashing scheme. However, you seem to be using it to mean "a content hash for serialized bincode" which, to me, is something very different from "recursively hashing structs"

I've stated multiple times I don't want to use bincode for anything and it's just a hack to get it to work right now. So no, I don't want to hash bincode, all I want is a recursive crypto hash of arbitrary rust structs. I don't want to make any canonical representations of data. I don't want to serialize and then hash. I don't want to invent a hashing scheme, just have a hashing trait that isn't broken like the default Hash.

tarcieri commented 7 years ago

all I want is a recursive crypto hash of arbitrary rust structs

I don't want to serialize and then hash

Then, by definition, you need an algorithm that knows how to compute cryptographic hashes of structured data.

Better make sure it isn't vulnerable to second-preimage attacks.

pedrocr commented 7 years ago

@tarcieri as far as I can tell that only applies if you are hashing individual structs and then hashing the concatenated hashes. I don't want that at all. I just want to initialize the hash, pass in all the values recursively, and finalize the hash. This has the same properties as serializing and then hashing without the performance penalty.

tarcieri commented 7 years ago

You need to domain separate the values somehow, either by length or by structure, and in either case you'll want to encode the types. The former pretty much involves a bincode-like scheme, and the latter an objecthash one.

Otherwise, you'll have ambiguities where different structures compute the same hash.

pedrocr commented 7 years ago

You need to domain separate the values somehow, either by length or by structure, and in either case you'll want to encode the types.

None of these are an issue for me.

Otherwise, you'll have ambiguities where different structures compute the same hash.

That's how Hash already works and is fine for me.

tarcieri commented 7 years ago

Ok, so it sounds like you want to abandon security... in which case why are you using a cryptographic hash?

pedrocr commented 7 years ago

@tarcieri none of these abandon security but even if they did I'm using hashes for caching. All I need is a statistical guarantee of non-collision absent an attacker.

tarcieri commented 7 years ago

If you're using hashes like this in any sort of security context and have ambiguities in a structured hashing format, an attacker can exploit them to trick you into accepting documents containing e.g. nested content and have it verify as if it was the same as a non-nested message. Second preimage attacks on naive Merkle Trees are an example of this, and it's similar in concept to other attacks regarding confusion of message structure such as packet injection attacks.

If you want to build a naive hashing scheme using cryptographically secure hashes for your own purposes, that's fine, but I would be strongly opposed to the inclusion of naive schemes in cryptography libraries, because people generally expect cryptography libraries to be built for security purposes.

pedrocr commented 7 years ago

@tarcieri I suggest you bring this up with the rust lib developers then. All I want is to get a hash calculated the exact same way Hash already works. I only want a crypto hash because 64bits are not enough collision protection for the application. If there's a way to create collisions easily with that kind of interface rust in general is vulnerable to them and that can be exploited to DoS systems at the very least. You haven't given an example of how that would happen though. All the cases you've pointed out are hashes of hashes situations.

tarcieri commented 7 years ago

I suggest you bring this up with the rust lib developers then.

Hash isn't intended to be a cryptographic hash for identifying content. It's for use in hash-like data structures, where collisions are easily tolerated by the underlying algorithms. The main threat to those is hashDoS, where an attacker can deliberately exploit repeated collisions to the same value. The Rust core developers addressed those by using SipHash, a PRF with a random key.

You're talking about creating a CryptoHash trait, which to me would ostensibly be for the purposes of secure content authentication. If you don't want it to be about security, I would suggest taking "crypto" out of the name.

You haven't given an example of how that would happen though.

These same attacks can happen without hashes-of-hashes. As I pointed out the general attack is one of failure to disambiguate message structures. Packet injection is the same basic idea, without any hashing involved whatsoever. That's a parsing ambiguity, but the same concept applies: the security system is confused about the message structure, allowing an attacker to trick it into doing something it shouldn't.

In my opinion if any messages with different structures can realistically produce the same content hash, the scheme is broken from a security perspective.

pedrocr commented 7 years ago

The Rust core developers addressed those by using SipHash, a PRF with a random key.

From what I've seen the current implementation doesn't actually randomize the key. But even if it did that's not relevant if it's trivial to generate a collision by just manipulating the structure.

These same attacks can happen without hashes-of-hashes. (...) In my opinion if any messages with different structures can realistically produce the same content hash, the scheme is broken from a security perspective.

I'd agree I just haven't found a case where that can happen with a setup like that of Hash. I'm curious to see one as I believe it would be a bug in Hash as well (or in some implementations of the trait).

tarcieri commented 7 years ago

From what I've seen the current implementation doesn't actually randomize the key.

Hash is just a trait. Some implementations defend against hashDoS and some don't.

According to the documentation, the implementation used by HashMap specifically implements the randomly-keyed cryptographic hash strategy (and uses std::hash::SipHasher I believe):

https://doc.rust-lang.org/std/collections/struct.HashMap.html

By default, HashMap uses a hashing algorithm selected to provide resistance against HashDoS attacks. The algorithm is randomly seeded, and a reasonable best-effort is made to generate this seed from a high quality, secure source of randomness provided by the host without blocking the program

pedrocr commented 7 years ago

At least on my machine the hash is not seeded. This always returns the same value:

https://gist.github.com/pedrocr/8d0283bed3f56e5cb6fb9fe0a785f947

But that's not relevant to the point as I specifically mentioned. If you can generate structs that have the same hashing because you manipulate the structure it doesn't matter that your hash function is randomly keyed. Within the same execution of the program those structs will colide, just with different values. And that's what would be the bug in Hash.

tarcieri commented 7 years ago

At least on my machine the hash is not seeded.

Use a HashMap and I'll expect you'll see different results. Hash itself is NOT designed to be a security primitive.

And to reiterate yet again, in the context where Hash security matters (i.e. HashMap and SipHasher), we're operating under a very different threat model where collisions are tolerated. Hash-like data structures naturally accommodate collisions, but are susceptible to an algorithmic attack if they rebalance to the same value repeatedly.

You're colluding threat models here: Hash and CryptoHash operate under different threat models.

In the one I would find acceptable for CryptoHash, collisions for attacker-controlled documents are not acceptable.

I'm not sure this is continuing to be a productive conversation: I would like the threat of ambiguous messages to be solved strategically for anything named CryptoHash. This is a well-studied problem in cryptography with a limited number of solutions.

I would find a hand-rolled naive and ambiguous scheme unacceptable. You haven't made a concrete proposal, but you're downplaying everything you need to do to solve these problems from a security perspective, and that's why I'm worried.

pedrocr commented 7 years ago

Hash-like data structures naturally accommodate collisions, but are susceptible to an algorithmic attack if they rebalance to the same value repeatedly.

And that's the attack that if it works with CryptoHash will also break Hash. HashMap tolerates collisions is they are fully random. After all it only has 64 bits of key. But if it's possible to generate a collision by just manipulating structure Hash and CryptoHash are equally broken.

I would find a hand-rolled naive and ambiguous scheme unacceptable. You haven't made a concrete proposal, but you're downplaying everything you need to do to solve these problems from a security perspective, and that's why I'm worried.

My proposal is very simple. Do exactly what Hash does but feed it to a crypto hash. I've yet to see an attack that breaks that but it may exist. If it does Hash should probably be fixed as well. So far the only case I've seen that's strange is that hashing a zero sized struct is a no-op. Not even that is exploitable in rust as far as I can tell. Other than that things like ("foo","bar") and ("foobar","") hash to different values already. Do you have an example where this breaks down?

RustCrypto / hashes

There should be a standard way to recusively hash structs #29