demerphq / smhasher

75 stars 12 forks source link

Trying to test some hashes but need a bit of information on the "state" #17

Open MatthewCushing opened 5 years ago

MatthewCushing commented 5 years ago

So I was able to get a variation of the FNV1a tests working by adding it in which was great. But honestly, I'm not 100% sure I've put it in correctly since I'm not sure what is getting passed in. If we look at one of your implementations:

FNV32a_with_state_test(const void *key, int len, const void *state, void *out)
{
    uint32_t h = *((uint32_t *)state);
    const uint8_t *data = (const uint8_t *)key;

    h ^= BIG_CONSTANT(2166136261);

    for(int I = 0; I < len; I++) {
        h ^= data[I];
        h *= 16777619;
    }

    *(uint32_t *) out = h;
}

The big thing that I'm not understanding is what is going on here:

h ^= BIG_CONSTANT(2166136261);

Am I correct that this is initially a value of 0? Thus XOR'ing it is doing nothing but keeping its original state that we have casted and giving it the constant value here?

This is a version of it we have implemented:

static std::uint32_t GenerateFnv1aHash(const std::string inputString)
        {
            std::uint32_t hash = 2166136261;

            for (char i : inputString)
            {
                hash = 16777619 * (hash ^ i);
            }

            return hash ^ (hash >> 16);
                }

The way I put it into SMHasher is:

FNV32a_My_Implementation_Test(const void *key, int len, const void *state, void *out)
{
    uint32_t h = *((uint32_t *)state);
    const uint8_t *data = (const uint8_t *)key;

    h ^= BIG_CONSTANT(2166136261);

    for(int I = 0; I < len; I++) {
        h ^= data[I];
        h *= 16777619;
    }

    *(uint32_t *) out = h ^ (h >> 16);
}

Or another has we've implemented is Djb2, here's our original implementation:

        static std::uint32_t GenerateDjb2Hash(const std::string inputString)
        {
            std::uint32_t hash = 5381;

            for (char i : inputString)
            {
                hash = 33 * hash ^ static_cast<unsigned char>(i);
            }

            return hash;
        }

And here's how I was implementing it into SMHasher:

Djb2_My_Implementation_Test(const void *key, int len, const void *state, void *out)
{
    uint32_t hash = *((uintew_t *)state);
    const uint8_t *data = (const uint8_t *)key;

    hash ^= 5381;

    for(int i = 0; i < len; ++i)
    {
        hash *= 33;
        hash ^= static_cast<unsigned char>(data[i]);
    }
    *(uint32_t *) out = hash;
}

Am I getting this right or am I getting this completely wrong? I'd apprecitate any help, I'm still in the process of getting familiar with hash functions, the terminology, and how it all comes together.

Thanks!

EDIT: I have a feeling I'm not putting these in right as I am getting some pretty big failures via the SMHasher tests.

For the version of the FNV1a that we used I got - Failed 107 of 195 tests For the version of Djb2 that we used I got - Failed 160 of 195 tests

I'm running your version of FNV1a to see if it's similar but I have a feeling it won't be.

EDIT2: Welp, I accidentally exited out of test running for your version of FNV1a but the amount of time it was taking to implement it vs our version was vastly different. I will run it again tomorrow when I can but I don't have time before I get off work. In the end, I feel as if I am not understanding the arguments correctly that I need to implement into the hash function. It would be great if you could explain what each argument is doing.

From what I understand: key = is the string that is going to be hashed len = the length of the key state = (not sure exactly) out = essentially the return value?

EDIT3: Well it seems the tests are similar but I'd love some confirmation that I'm understanding this correctly.

demerphq commented 5 years ago

Most hashes are seeded, and that seed is transformed into an internal state before hashing. So for instance a hash might take a 128 bit seed, and then transform it into a 256 bit state, and hashing proceeds against the state. Sometimes this operation is expensive. My version of smhasher separates this into two steps. The first takes a seed and sets up the state, and the second uses that state to hash. Consider djb2 hash, it does not take a seed, and initializes its 32-bit state to 5381. If it were seeded then it would allow you to set the state explicitly via a seed. So for instance it might do:

state = seed ^ 5381;

or something similar. Maybe it might take a 16 bit seed and then do something like:

state = ((seed ^ 123) << 16 | (seed ^ 5381));

so that it ensures the state does not start as zero.

out is the return value.