fake-name / IntraArchiveDeduplicator

Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
BSD 3-Clause "New" or "Revised" License
96 stars 12 forks source link

getWithinDistance returns [0] in some cases #2

Closed gpip closed 6 years ago

gpip commented 7 years ago

Hello, I've been using your c++ bktree implementation and just repackaged it as a python package (https://github.com/gpip/cBKTree). One thing I noticed is that in some cases getWithinDistance returns [0] as the result, which seems to be some sort of sentinel node. This happens both when there are no matches within a distance or when there are, but it doesn't happen on every situation.

Here's an example where that happens:

from cbktree import BkHammingTree, explicitSignCast

DATA = {
    # Format: id -> bitstring
    1: '1011010010010110110111111000001000001000100011110001010110111011',
    2: '1011010010010110110111111000001000000001100011110001010110111011',
    3: '1101011110100100001011001101001110010011100010011101001000110101',
}

SEARCH_DIST = 2  # 2 out of 64 bits

int_bits = lambda b: explicitSignCast(int(b, 2))

tree = BkHammingTree()
descriptor = {}
for node_id, bits in DATA.items():
    ib = int_bits(bits)
    descriptor[node_id] = ib
    tree.insert(ib, node_id)

# Find near matches for each node that was inserted.
for node_id, ib in descriptor.items():
    res = tree.getWithinDistance(ib, SEARCH_DIST)
    print("{}: {}".format(node_id, res))

# Find near matches for items that were not inserted.

new = '1101011110100100001011001101001110010011100010011101001000110101'
print("new: {}".format(tree.getWithinDistance(int_bits(new), SEARCH_DIST)))

ones = '1' * 64
print("111..: {}".format(tree.getWithinDistance(int_bits(ones), SEARCH_DIST)))

# XXX Should return empty, returns [0] instead.
zeroes = '0' * 64
print("000..: {}".format(tree.getWithinDistance(int_bits(zeroes), SEARCH_DIST)))

My understanding is that the last call should return an empty set, just like the call before that one does. Have you hit this before?

fake-name commented 7 years ago

Huh. I haven't seen that particular issue, but it definitely sounds like a real problem.

I'll poke about and see if I can figure out what's going on.

fake-name commented 7 years ago

Well, a bunch of the unit tests are failing, but that looks like it's because the file library decided to start decoding their return values (they were originally bytestrings, at one point, now they're unicode).

Not the problem, but it's definitely something. I'm writing a test case for the problem you described.

fake-name commented 7 years ago

Alright, I think this should be fixed. The issue was how I was stubbing an empty tree (I was adding a key with id 0 and phash 0).

I've changed it to actually not create the tree root until you insert something. I also added a test-case for this particular issue.

Thanks! It's awesome someone is getting use out of my silly project.

fake-name commented 7 years ago

Did this resolve the issue you were seeing?

gpip commented 7 years ago

Hi, I was traveling and didn't have the chance to test it. Doing it now, thank you for patching it so fast.

gpip commented 7 years ago

Looks perfect 👍

fake-name commented 7 years ago

Sweet!

If I can ask, what are you using the BKTree for? Are you doing fuzzy image searching too?

gpip commented 7 years ago

That's right, I'm using a simple perceptual hash as the descriptor (pretty much the same as the phash you have here, except you're using the mean whereas I'm using median) which is stored in the tree and then it's used as an initial attempt for finding near duplicates. Before this I was doing some tests using simple pairwise distance to check whether that descriptor could be useful or not for the situation I have.

Since I don't start with a "clean" dataset, i.e. there's no guarantee that duplicates aren't already there, after inserting the initial data I run getWithinDistance for each descriptor that was inserted which sort of mimics the pairwise distance but is much faster using this metric tree. After that initial step, as new data comes in it gets checked against a set of nodes to check for duplicates.

Are you actively using this project?

fake-name commented 7 years ago

@gpip - Yep. I use it as part of https://github.com/fake-name/MangaCMS. In particular, the rest of this repository functions as an overlay layer on top of Postgres, maintaining a in-memory copy of the PHash values from the database in a tree structure, and intercepting DML and applying them to the tree as well.

It is then used for scanning incoming manga archives, and doing set deduplication to prevent the introduction of duplicates.

I'm not actively developing it (as it seems reasonably stable), but it's definitely in use (and currently occupying ~10.9 GB of ram on my server, for approximately 22798781 data-rows).

I should note that the tree has no problems with duplicate phash-entries, assuming that the entry id are unique. A single leaf node for a particular hash can contain an arbitrarily large number of row ids.

As it is, my data-set certainly has duplicates. My focus is deduplicating files stored in archives, so for an archive to be regarded as a dupe all the contents of the archive have to be duplicated elsewhere. Having a single non-duplicate file results in the archive being marked as unique. As such, the tree needs to be at least capable of storing references to each instance of a common file.

In this use case, the id is simply the primary key for the database row the entry corresponds to, so it's required to be unique. While I don't think duplicate id values would crash anything, I'm not sure what other effects it would have. Allowing them to be duplicated would probably not be a great idea.

gpip commented 7 years ago

That's cool, so far you're using it for a much larger dataset than I am.

Sorry if I confused you, the ids in use are unique for my case as well (by coincidence they are also coming from a postgresql database, but phashes are stored on redis). I meant to say that after inserting, let's say, nodes with ids 1, 2, and 3 I run getWithinDistance against the descriptor collected for node 1, node 2, and node 3 to check for duplicates within these descriptors. I store the dupe information on redis, and then when a new descriptors comes in I first store it in the tree, then run getWithinDistance (which will always return at least the node that was just inserted), and then update the dupe info on redis for that new node.

fake-name commented 7 years ago

Ok, that makes more sense.

If you see any other issues, please, please let me know. The tree code hasn't given me enough issues that I feel confident in it (if it hasn't exploded everywhere, it means it just hasn't been set off yet!), so I have vague suspicions I've been unable to validate.

I'm not that great of a coder, so the fact that I haven't seen my normal level of bugs means I start getting uncomfortable that I'm missing something spectacular.

You might have seen, I have a pile of unit-tests for the tree so I'm always open to adding more test-cases.

gpip commented 7 years ago

One of the reasons I'm using your lib is because of the tests, I tried some other ones but most lacked tests and others didn't work at all.

BTW, do you think it's a bad idea for me to maintain the bktree code at https://github.com/gpip/cbktree? I could add you as a maintainer there, and as you said it's stable so probably nothing needs to be done for now (except for adding the tests to that repo). I don't mind at all transferring the pypi project to you and you can maintain it under a different account (i.e. your github account). I only created that repo as it was easier for me to use only the bktree, and nothing else.

fake-name commented 7 years ago

Honestly, feel free to do whatever's easiest. I realize I haven't added a LICENSE.txt, but I'll stick a bsd license in there later this evening, now that I thought of it.

Realistically, the actual tree code has been basically untouched for the last year (excepting this one recent patch), so it's probably stable enough to just package and forget.

fake-name commented 7 years ago

Never mind - Did it with the web client.

fake-name commented 7 years ago

It's worth noting that I've seen a possible null-pointer dereference somewhere in the tree code since the changes to fix the "return zero" issue.

I'm not sure what's causing it (or if it even is a null pointer dereference), but it somehow causes my server application to crash hard and lock up (without any output, and in a manner that ignores ctrl+c.

So far, I've seen it twice in the last week, the first time I didn't realize what it was, and the second time I didn't have python debug symbols.

When/If it happens next, I should be able to track down the cause since I have the required symbols now. Unfortunately, I don't think I could attach the required symbols to a running, stuck process.

Attaching to the wedged process with gdb (and no symbols) produced the following:

durr@mainnas:~⟫ sudo gdb python3 -p 6737
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...(no debugging symbols found)...done.
Attaching to program: /usr/bin/python3, process 6737
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007fee28ac27be in ?? ()
(gdb) bt
#0  0x00007fee28ac27be in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb)

The second entry in the stack-trace being null looks extremely suspicious, but the fact that the entire stack is only 2 frames deep is also odd, so it could very well be an overrun as well.

I'll have more the next crash.

gpip commented 7 years ago

I haven't hit that yet. The differences are that I'm using a lot smaller dataset and Python 2. Have you reverted the changes so it might return the sentinel node 0? If you have never hit the issue when that was in place maybe it's ok to revert it.

fake-name commented 7 years ago

Nah, the changes were a completely valid fix, and it does pass the unit-tests consistently.

It also has subsequently been running for the last 4 days without issue, so.... ¯\_(ツ)_/¯

I'll leave the issue open for a while, until it either reoccurs or I haven't seen it in a few weeks. It's possible it was a machine issue.

fake-name commented 7 years ago
durr@mainnas:~⟫ sudo gdb python3 -p 15253
[sudo] password for durr:
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...Reading symbols from /usr/lib/debug//usr/bin/python3.4...done.
done.
Attaching to program: /usr/bin/python3, process 15253
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007fd8be0ba7be in ?? ()
(gdb) bt
#0  0x00007fd8be0ba7be in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb) py-bt
Traceback (most recent call first):
(gdb)

Well.... huh.

Time to add a bunch of asserts, I guess.