Closed thor27 closed 6 years ago
It seems you use Python 3.x?
I have reproduced the issue in Python 2.7, the unicode string will be hashed to 2085578581L.
In [1]: import pyhash
In [2]: pyhash.murmur3_32()('foo')
Out[2]: 4138058784L
In [3]: pyhash.murmur3_32()(u'foo')
Out[3]: 2085578581L
please try the latest git commit, it should be aligned in both Python 2.x and 3.x
Yes, I was using python3, I will update here and test. Thanks.
Hi, the syntax of numbers with L at the end does not work on python3:
In [1]: import pyhash
Traceback (most recent call last):
File "/home/thomaz/projetos/thumbor/vpython-pyfasthash/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-1-6937f29f5d7b>", line 1, in <module>
import pyhash
File "/home/thomaz/projetos/thumbor/pyfasthash/pyhash.py", line 89
bytes_hash=3698262380L,
^
SyntaxError: invalid syntax
I've tryed to just remove all "L" and seems to import fine, but doesn't works:
In [1]: import pyhash
In [2]: pyhash.murmur3_32()('foo'.encode('ascii'))
Out[2]: 4138058784
In [3]: pyhash.murmur3_32()('foo')
Out[3]: 2085578581
sure, after you 'foo'.encode('ascii')
or use b'foo', it is a bytes
string in Python 3.x, or str
in Python 2.x, the hash value (4138058784) should be difference to a unicode
string (2085578581).
# https://github.com/flier/pyfasthash/issues/24
def testDefaultStringType(self):
hasher = murmur3_32()
self.assertEqual(hasher('foo'), hasher(u'foo'))
self.assertNotEqual(hasher('foo'), hasher(b'foo'))
Besides, the L
suffix will be automatic removed by 2to3
conversion tools in setup steps
Hi! That is ok, I understood, but having different values for string and bytes is a desired behaviour? It is at least very error prone. It should be interesting to have at least a note on README about this. For example, because of this I indexed a 35 TB data structure (file system based) incorrectly and I had to reindex everything again (3 days process to complete). Anyway, it's working as desired in my codebase here now as I understand the issue. Thanks a lot for the support!
In Python, str
and unicode
is totally different type, so, the hash value is definitely difference. Please check the discussion for more details :)
Sorry about the wasted time, I will update the README soon, thanks.
Hi!
I'm seeing different values when using other murmur libs in python, take a look:
I've also tried this online tool: http://murmurhash.shorelabs.com/ with the same 4138058784 hash value. Why the values from pyhash differs from other implementations? it's possible to get the same result?
Thanks!