HowardHinnant / hash_append

Implementation of hash_append proposal
67 stars 12 forks source link

uhash should take a seed #6

Open pdimov opened 7 years ago

pdimov commented 7 years ago

uhash should have a constructor that takes a seed. One possible (performance-oriented) implementation could then xor the seed with h before returning it from operator(). Combined with initializing the seed to 0 in the default constructor, this preserves existing behavior without introducing a branch and an additional call to hash_append.

It's of course possible to seed in a hash adapter, as explained, but this makes seeded hashing second-class, something that the user needs to work for, and it needs to be easy.

Seeding hash functions is all the rage nowadays among the security-savvy, and without explicit support, the standard library may well decide to do it on the container level, which makes it impossible for the user to influence it or supply a seed. It would be better for those standard library implementations to have the option to process-wide seed in the default constructor of uhash instead, in which case the user would be able to override it.

HowardHinnant commented 7 years ago

Here's the rationale for why seeding isn't explicitly handled: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#seeding

tl;dr: There is more than one good way to do this (per-process or per-hash-functor), and both are easy to do. So this proposal declined to specify what is the best way for everyone.

Also, the various seedable hash algorithms typically use the seed in an algorithm-specific way to initialize the algorithm's state. Blindly taking a seed and xor-ing that seed to any algorithm's result sounds dangerous to me. Hash algorithms are a black art that I am nearly completely ignorant of. The point of this proposal is to not mess with hash algorithms, just enable clients to use them as their author's designed, without modifications or compromises. I.e. to not do, require, or even enable anything like boost::hash_combine.

pdimov commented 7 years ago

Yes, I did read this section; it would be much simplified by my suggested addition.

Blindly taking a seed and xor-ing that seed to any algorithm's result sounds dangerous to me.

For any reasonable "good hash" criterion you give me, I'll give you a mathematical proof that it also holds for h^const.

That said, my initial impulse was to hash_append the seed, instead of xor'ing it, but Vinnie Falco objected that this would impose performance costs.

Also, the various seedable hash algorithms typically use the seed in an algorithm-specific way to initialize the algorithm's state.

Yes, that's true, and to take advantage of algorithm-specific ways you have to write an algorithm-specific hasher as per the aforequoted section. We're here talking about the universal hasher, which by its universal nature is not algorithm-specific.

Putting the responsibility to support seeding on the hash algorithm may be more principled, but from user's point of view, in a seeded scenario it's much more convenient if all hash algorithms worked, so that the user can easily try them and choose the best one.

HowardHinnant commented 7 years ago

I think one of my big mistakes with N3980 was too much rambling, and not a clear presentation of what was actually being proposed. It implied more complication than there is. If I ever get around to a revision, I will try to greatly simplify the presentation.

Most of the hash algorithms that are seedable use (or can be made to use) a generic syntax for seeding (construct the algorithm with a seed). So it seems logical to take advantage of that. The proposal could include both a per-process seed hasher, and a per-instance seed hasher, but that adds complication where I need to take it away. But maybe it is worth it.

The per-instance seed hasher turns a state-less hasher into a stateful one, adding space overhead to every container. Plus gcc-5.2 has a bug that doesn't handle stateful hashers. I'm not sure if it has been fixed yet.

The per-process seed hasher requires a singleton in the source to retrieve the seed from.

Your suggestion makes uhash stateful, whether or not the seed is used. So if the lib is going to supply a stateful seeded hasher, I think it has to be a different type. I don't want to give up the option of the stateless hasher.

pdimov commented 7 years ago

There are alternative approaches to this, but I don't like them. Such as for instance uhash<hasher, int> having a constructor taking int and constructing hasher with it in op() instead of always default-constructing.

It'd be better if authors of hash algorithms weren't required to have a constructor taking a seed. Although I suppose that we could do that for the standard algorithms we define.

pdimov commented 7 years ago

Something like

template <class Hasher, class... Args> class uhash
{
private:

    std::tuple<Args...> args_;

public:

    using result_type = typename Hasher::result_type;

    explicit uhash( Args... args ): args_( args... )
    {
    }

    template <class T> result_type operator()(T const& t) const noexcept
    {
        Hasher h = std::make_from_tuple<Hasher>( args_ );
        hash_append(h, t);
        return static_cast<result_type>(h);
    }
};
pdimov commented 7 years ago

Or, after looking at your examples more closely, make that

template<class... U> explicit uhash(U&&... u): args_(std::forward<U>(u)...) {}

and then

template <class HashAlgorithm = acme::siphash>
struct process_seeded_hash: uhash<HashAlgorithm, size_t, size_t>
{
    process_seeded_hash(): uhash(get_process_seed()) {}
};

and

template <class HashAlgorithm = acme::siphash>
struct randomly_seeded_hash: uhash<HashAlgorithm, size_t, size_t>
{
    randomly_seeded_hash(): uhash(get_random_seed()) {}
};
HowardHinnant commented 7 years ago

Not bad. And it looks like I can derive from tuple<Args...> and uhash<Hasher> would still be an empty class.

pdimov commented 7 years ago

I didn't think of that, was assuming a specialization for the no-args case.

On third thought though, I take back the forwarding constructor, don't want default-constructability in the argumentful case. Better to have two constructors taking Args... and tuple<Args...>, not one with U&&....