Closed gpip closed 6 years ago
Oh, I just noticed that on your tests you're using bitstring.Bits
, which would produce -9223372036854775808 for that descr
. I guess I should avoid using explicitSignCast in that case?
But then I still get a 0 hamming dist, whereas 1 should be returned:
>>> hamming_dist(0, -9223372036854775808L)
0
Ahhh sign conversion issues, my old friend.
I'm kind of stuck with int64_t
as my base datatype, as it's the biggest integral type I can stuff into postgres.
I'm not too sure what's going on with the hamming dist call. I do have some tests for numbers around zero. Lemme put some more test cases together this evening.
Yes, why not :)
I did another quick test here, outside Vagrant that hamming_dist
call returns 1 as expected, but inside Vagrant it returns 0. Uh..
Whooo! Platform specific issues! Huzzah! And not even different CPUs, just virtualized vs not virtualized.
I'd be interested in whether python3 -m nose Tests.Test_BKTree
and python3 -m nose Tests.Test_BKTree_2
pass. There are some unit tests in there that try to expose any sign-conversion issues.
On my non-virtual machine all the tests pass after I change the bitstring usage to hamDb.explicitSignCast(int(binaryStringIn, 2))
(otherwise all the tests fail with ValueError: Hashes must be an integer! Passed value '0', type '<type 'long'>'
).
Under Vagrant several tests fail, e.g.:
self = <test_bktree_2.TestSequenceFunctions_TallTree testMethod=test_3>
def test_3(self):
tgtHash = "1000000000000000000000000000000000000000000000000000000000000000"
tgtHash = b2i(tgtHash)
ret = self.tree.getWithinDistance(tgtHash, 0)
> self.assertEqual(ret, set((64, )))
E AssertionError: Items in the first set but not the second:
E 0
test/test_bktree_2.py:232: AssertionError
self = <test_bktree.TestSequenceFunctions testMethod=test_distance_f>
def test_distance_f(self):
v1 = b2i("0000000000000000000000000000000000000000000000000000000000000000")
v2 = b2i("1111111111111111111111111111111111111111111111111111111111111111")
v3 = b2i("0000000000000000000000000000000000000001111111111111111000000000")
v4 = b2i("1000000000000000000000000000000000000000000000000000000000000000")
v5 = b2i("0000000000000000000000000000000000000000000000000000000000000001")
v6 = b2i("1100000000000000000000000000000000000000000000000000000000000000")
> self.assertEqual(hamDb.f_hamming_dist(v1, v2), 64)
E AssertionError: 45 != 64
test/test_bktree.py:156: AssertionError
(it's very likely the lines don't match exactly, I copied your tests to https://github.com/gpip/cBKTree/tree/master/test)
EDIT: in the broken box
x = int('1111111111111111111111111111111111111111111111111111111111111111', 2)
assert x == 18446744073709551615 # OK
res = int(x, 2))
assert res == -1 # FAILS, res is 70368207306751
EDIT 2: still in the broken box: numpy.array([18446744073709551615L], numpy.uint64).astype(numpy.int64) == array([-1]) # OK
Is it possible the vagrant install is 32 bit?
I don't think this would work at all on a 32 bit system. I don't see anything that assumes a machine word size, but I'm having a go at converting everything to unsigned integers anyways. That removes some of the possible sign conversion issues.
It's using the wheezy64
image, uname -m
returns "x86_64", uname -a
returns "Linux wheezy64 3.2.0-4-amd64 #1 SMP Debian 3.2.73-2+deb7u3 x86_64 GNU/Linux". Didn't see any warnings during the build, tried removing all the flags, and also built with debug flags, but got the same issue.
Well WTH.
Arrrgh, converting to uint
s is going to be a prohibitive PITA, since I'm kinda locked into it by my database backend.
Installing vagrant now.
Oh, what CPU are you running?
One thing I can think of is I'm using __builtin_popcountll(x)
for counting bits in the distance bitfield. I'm not sure what it's support is on AMD processors (I have only intel servers - AMD servers are too expensive in terms of power).
Inside vagrant it reports the same CPU I have, a typical Intel(R) Core(TM) i7-4770HQ in a macbook (which has no issue running this out the vm). Maybe I'm tired, can't see what's causing this. FWIW, I ran this on a AWS box and got no issues, maybe it's not worth worrying about it.
The issue is not with __builtin_popcountll
, the very simple explicitSignCast
is causing these issues.
I think the way to go about is to 1) not rely on explicitSign, use bitstring.Bits(bin=...).int
instead, and 2) accept (int, long)
for nodeHash for Python 2 compatibility.
I assume that this codebase is reliable. If the sign-casting mechanic is behaving oddly, I want to know why. I'll keep poking around.
If the problem is limited to explicitSignCast
, I think that shouldn't be too big of an issue, as it only exists in the first place for testing purposes (the only use of it in the IntraArchiveDeduplicator code-base, at least is in the tests). I do assume that the underlying signed storage mechanism uses 2's complement, but I think that's kind of unavoidable.
Python 2 compatibility is something I don't care about even a little bit. It's been 8 years, it's time to upgrade.
Ok, I kind of suspect this is a python 3.2 issue. My testing targets are 3.4 and 3.5, and I've more or less moved away from systems with 3.4 entirely.
Why are you using a system with such an ancient version of python?
Ok, so wheezy is retarded and doesn't ship a sane version of python. I built python 3.5 on wheezy, and it passes most of the unit tests. It only fails on one of the archive scan tests, because the version of I patched that one case, they all pass now.libmagick
on wheezy is out-of-date enough that it doesn't support the mime
argument.
I'm legitimately curious how you even got things to install on wheezy. I can't even install pip
on it because apparently it doesn't support python 3.2 anymore.
It passes with 3.4 on wheezy too.
I ran these using Python 2.7, but let me test the latest patch release of 2.7 (because it ships with a very old one).
It works with Python 2.7.10 and 2.7.13 (the only ones I tested), it fails with Python 2.7.3 inside Vagrant but works with Python 2.7.3 outside Vagrant. It's weird to me because numpy is able to do that conversion, so maybe it's a cython bug.
That sounds reasonable to me.
How are you installing cython? Can you install a more recent version with a non-platform-sourced pip
?
(Sorry, I'm basically unfamiliar with debian, and I don't know if they do something similar to how ubuntu has a custom version of pip which fetches packages from ubuntu-specific repositories).
@gpip - If you're using this as a library, I'd strongly suggest you update if you need any concurrency support.
I discovered a issue where I had misunderstood the relationship between python threads and OS threads, leading to a potential issue where there could be lock-loss or lock-wedging.
Basically, I was assuming that there was a direct mapping between python threads and OS threads (supported by the fact that python seems to create an additional OS thread for each python thread you instantiate.
Apparently this is not true, and python actually swaps execution between separate threads within the same OS pid. This means that my prior use of the pthread
rwlock
primitive could lead to situations where a lock can be repeatedly acquired or released from a single OS thread.
In any event, I reverted to an earlier, albeit less theoretically performant python rwlock implementation that handles the python threading behaviour correctly.
Hi,
Given a 64 bit descriptor like the following
it seems there's no way to insert it correctly:
That is, both
000...000
and100..000
would be inserted as the number 0,000...001
and100...001
as the number 1, and so on. Does this mean the tree as it stands right now should be used with 63 bits descriptors instead?Thank you