fake-name / IntraArchiveDeduplicator

Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
BSD 3-Clause "New" or "Revised" License
98 stars 12 forks source link

Tree of uint64 instead of int64 #3

Closed gpip closed 6 years ago

gpip commented 7 years ago

Hi,

Given a 64 bit descriptor like the following

descr = "1000000000000000000000000000000000000000000000000000000000000000"
assert len(descr) == 64

it seems there's no way to insert it correctly:

>>> int(descr, 2)
9223372036854775808L
>>> explicitSignCast(int(descr, 2))
0

That is, both 000...000 and 100..000 would be inserted as the number 0, 000...001 and 100...001 as the number 1, and so on. Does this mean the tree as it stands right now should be used with 63 bits descriptors instead?

Thank you

gpip commented 7 years ago

Oh, I just noticed that on your tests you're using bitstring.Bits, which would produce -9223372036854775808 for that descr. I guess I should avoid using explicitSignCast in that case?

But then I still get a 0 hamming dist, whereas 1 should be returned:

>>> hamming_dist(0, -9223372036854775808L)
0
fake-name commented 7 years ago

Ahhh sign conversion issues, my old friend.

I'm kind of stuck with int64_t as my base datatype, as it's the biggest integral type I can stuff into postgres.

I'm not too sure what's going on with the hamming dist call. I do have some tests for numbers around zero. Lemme put some more test cases together this evening.

gpip commented 7 years ago

Yes, why not :)

I did another quick test here, outside Vagrant that hamming_dist call returns 1 as expected, but inside Vagrant it returns 0. Uh..

fake-name commented 7 years ago

Whooo! Platform specific issues! Huzzah! And not even different CPUs, just virtualized vs not virtualized.

I'd be interested in whether python3 -m nose Tests.Test_BKTree and python3 -m nose Tests.Test_BKTree_2 pass. There are some unit tests in there that try to expose any sign-conversion issues.

gpip commented 7 years ago

On my non-virtual machine all the tests pass after I change the bitstring usage to hamDb.explicitSignCast(int(binaryStringIn, 2)) (otherwise all the tests fail with ValueError: Hashes must be an integer! Passed value '0', type '<type 'long'>').

Under Vagrant several tests fail, e.g.:

self = <test_bktree_2.TestSequenceFunctions_TallTree testMethod=test_3>

    def test_3(self):
        tgtHash = "1000000000000000000000000000000000000000000000000000000000000000"
        tgtHash = b2i(tgtHash)
        ret = self.tree.getWithinDistance(tgtHash, 0)
>       self.assertEqual(ret, set((64, )))
E       AssertionError: Items in the first set but not the second:
E       0

test/test_bktree_2.py:232: AssertionError
self = <test_bktree.TestSequenceFunctions testMethod=test_distance_f>

    def test_distance_f(self):

        v1 = b2i("0000000000000000000000000000000000000000000000000000000000000000")
        v2 = b2i("1111111111111111111111111111111111111111111111111111111111111111")
        v3 = b2i("0000000000000000000000000000000000000001111111111111111000000000")
        v4 = b2i("1000000000000000000000000000000000000000000000000000000000000000")
        v5 = b2i("0000000000000000000000000000000000000000000000000000000000000001")
        v6 = b2i("1100000000000000000000000000000000000000000000000000000000000000")
>       self.assertEqual(hamDb.f_hamming_dist(v1, v2), 64)
E       AssertionError: 45 != 64

test/test_bktree.py:156: AssertionError

(it's very likely the lines don't match exactly, I copied your tests to https://github.com/gpip/cBKTree/tree/master/test)

EDIT: in the broken box

x = int('1111111111111111111111111111111111111111111111111111111111111111', 2)
assert x == 18446744073709551615   # OK
res = int(x, 2))
assert res == -1   # FAILS, res is 70368207306751

EDIT 2: still in the broken box: numpy.array([18446744073709551615L], numpy.uint64).astype(numpy.int64) == array([-1]) # OK

fake-name commented 7 years ago

Is it possible the vagrant install is 32 bit?

I don't think this would work at all on a 32 bit system. I don't see anything that assumes a machine word size, but I'm having a go at converting everything to unsigned integers anyways. That removes some of the possible sign conversion issues.

gpip commented 7 years ago

It's using the wheezy64 image, uname -m returns "x86_64", uname -a returns "Linux wheezy64 3.2.0-4-amd64 #1 SMP Debian 3.2.73-2+deb7u3 x86_64 GNU/Linux". Didn't see any warnings during the build, tried removing all the flags, and also built with debug flags, but got the same issue.

fake-name commented 7 years ago

Well WTH.

Arrrgh, converting to uints is going to be a prohibitive PITA, since I'm kinda locked into it by my database backend.

Installing vagrant now.

fake-name commented 7 years ago

Oh, what CPU are you running?

One thing I can think of is I'm using __builtin_popcountll(x) for counting bits in the distance bitfield. I'm not sure what it's support is on AMD processors (I have only intel servers - AMD servers are too expensive in terms of power).

gpip commented 7 years ago

Inside vagrant it reports the same CPU I have, a typical Intel(R) Core(TM) i7-4770HQ in a macbook (which has no issue running this out the vm). Maybe I'm tired, can't see what's causing this. FWIW, I ran this on a AWS box and got no issues, maybe it's not worth worrying about it.

The issue is not with __builtin_popcountll, the very simple explicitSignCast is causing these issues.

gpip commented 7 years ago

I think the way to go about is to 1) not rely on explicitSign, use bitstring.Bits(bin=...).int instead, and 2) accept (int, long) for nodeHash for Python 2 compatibility.

fake-name commented 7 years ago

I assume that this codebase is reliable. If the sign-casting mechanic is behaving oddly, I want to know why. I'll keep poking around.

If the problem is limited to explicitSignCast, I think that shouldn't be too big of an issue, as it only exists in the first place for testing purposes (the only use of it in the IntraArchiveDeduplicator code-base, at least is in the tests). I do assume that the underlying signed storage mechanism uses 2's complement, but I think that's kind of unavoidable.


Python 2 compatibility is something I don't care about even a little bit. It's been 8 years, it's time to upgrade.

fake-name commented 7 years ago

Ok, I kind of suspect this is a python 3.2 issue. My testing targets are 3.4 and 3.5, and I've more or less moved away from systems with 3.4 entirely.

Why are you using a system with such an ancient version of python?


Ok, so wheezy is retarded and doesn't ship a sane version of python. I built python 3.5 on wheezy, and it passes most of the unit tests. It only fails on one of the archive scan tests, because the version of libmagick on wheezy is out-of-date enough that it doesn't support the mime argument. I patched that one case, they all pass now.


I'm legitimately curious how you even got things to install on wheezy. I can't even install pip on it because apparently it doesn't support python 3.2 anymore.


It passes with 3.4 on wheezy too.

gpip commented 7 years ago

I ran these using Python 2.7, but let me test the latest patch release of 2.7 (because it ships with a very old one).

gpip commented 7 years ago

It works with Python 2.7.10 and 2.7.13 (the only ones I tested), it fails with Python 2.7.3 inside Vagrant but works with Python 2.7.3 outside Vagrant. It's weird to me because numpy is able to do that conversion, so maybe it's a cython bug.

fake-name commented 7 years ago

That sounds reasonable to me.

How are you installing cython? Can you install a more recent version with a non-platform-sourced pip?

(Sorry, I'm basically unfamiliar with debian, and I don't know if they do something similar to how ubuntu has a custom version of pip which fetches packages from ubuntu-specific repositories).

fake-name commented 7 years ago

@gpip - If you're using this as a library, I'd strongly suggest you update if you need any concurrency support.

I discovered a issue where I had misunderstood the relationship between python threads and OS threads, leading to a potential issue where there could be lock-loss or lock-wedging.

Basically, I was assuming that there was a direct mapping between python threads and OS threads (supported by the fact that python seems to create an additional OS thread for each python thread you instantiate.

Apparently this is not true, and python actually swaps execution between separate threads within the same OS pid. This means that my prior use of the pthread rwlock primitive could lead to situations where a lock can be repeatedly acquired or released from a single OS thread.

In any event, I reverted to an earlier, albeit less theoretically performant python rwlock implementation that handles the python threading behaviour correctly.