flier / pyfasthash

Python Non-cryptographic Hash Library
Apache License 2.0
281 stars 50 forks source link

[question] Differrent values on other implementations (murmur3 32 bits) #24

Closed thor27 closed 6 years ago

thor27 commented 6 years ago

Hi!

I'm seeing different values when using other murmur libs in python, take a look:

In [1]: import pyhash

In [2]: import mmh3 # https://github.com/hajimes/mmh3

In [3]: import pymmh3 # https://github.com/wc-duck/pymmh3

In [4]: pyhash.murmur3_32()('foo')
Out[4]: 2085578581

In [5]: mmh3.hash('foo', signed=False)
Out[5]: 4138058784

In [10]: pymmh3.hash('foo') + 2**32
Out[10]: 4138058784

I've also tried this online tool: http://murmurhash.shorelabs.com/ with the same 4138058784 hash value. Why the values from pyhash differs from other implementations? it's possible to get the same result?

Thanks!

flier commented 6 years ago

It seems you use Python 3.x?

I have reproduced the issue in Python 2.7, the unicode string will be hashed to 2085578581L.

In [1]: import pyhash

In [2]: pyhash.murmur3_32()('foo')
Out[2]: 4138058784L

In [3]: pyhash.murmur3_32()(u'foo')
Out[3]: 2085578581L
flier commented 6 years ago

please try the latest git commit, it should be aligned in both Python 2.x and 3.x

thor27 commented 6 years ago

Yes, I was using python3, I will update here and test. Thanks.

thor27 commented 6 years ago

Hi, the syntax of numbers with L at the end does not work on python3:

In [1]: import pyhash
Traceback (most recent call last):

  File "/home/thomaz/projetos/thumbor/vpython-pyfasthash/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-1-6937f29f5d7b>", line 1, in <module>
    import pyhash

  File "/home/thomaz/projetos/thumbor/pyfasthash/pyhash.py", line 89
    bytes_hash=3698262380L,
                         ^
SyntaxError: invalid syntax
thor27 commented 6 years ago

I've tryed to just remove all "L" and seems to import fine, but doesn't works:

In [1]: import pyhash

In [2]: pyhash.murmur3_32()('foo'.encode('ascii'))
Out[2]: 4138058784

In [3]: pyhash.murmur3_32()('foo')
Out[3]: 2085578581
flier commented 6 years ago

sure, after you 'foo'.encode('ascii') or use b'foo', it is a bytes string in Python 3.x, or str in Python 2.x, the hash value (4138058784) should be difference to a unicode string (2085578581).

    # https://github.com/flier/pyfasthash/issues/24
    def testDefaultStringType(self):
        hasher = murmur3_32()

        self.assertEqual(hasher('foo'), hasher(u'foo'))
        self.assertNotEqual(hasher('foo'), hasher(b'foo'))

Besides, the L suffix will be automatic removed by 2to3 conversion tools in setup steps

thor27 commented 6 years ago

Hi! That is ok, I understood, but having different values for string and bytes is a desired behaviour? It is at least very error prone. It should be interesting to have at least a note on README about this. For example, because of this I indexed a 35 TB data structure (file system based) incorrectly and I had to reindex everything again (3 days process to complete). Anyway, it's working as desired in my codebase here now as I understand the issue. Thanks a lot for the support!

flier commented 6 years ago

In Python, str and unicode is totally different type, so, the hash value is definitely difference. Please check the discussion for more details :)

Sorry about the wasted time, I will update the README soon, thanks.