Fingerprint matching - Githubissues

lnicola commented 6 years ago

Hi, I've stumbled upon your blog post and skimmed the code a little. If I'm not mistaken, you've been optimizing the comparison between two fingerprints. In practice, however, you'll be comparing on the order of N^2 pairs of fingerprints, not just two, which still grows proportionally to the square of the number of songs (10 times more songs will take 100 time longer).

So I propose another approach, in the form of a a memory-time-complexity trade-off:

for each song, extract the fingerprint and add it into a data structure that supports fuzzy searching
for each song, compute its fingerprint and look up matches in the data structure built previously

Choosing and implementing a data structure is the hard part, of course. One that might work is a Levenshtein automaton. These have the advantage of naturally supporting insertions and deletions, so you wouldn't have to insert 200 fingerprints for each song. I never tried to implement one, but this post gives a nice introduction.

One limitation, however, is that the way these are usually built, they allow insertions both at the ends and in the middle. For fingerprint comparisons you'd probably want to allow insertions and deletions only at the ends, and only changes in the middle. But you'd probably have to implement it yourself in that case. An alternative is to build a DFA that only allows substitutions, and also insert the shifted fingerprints into it.

Anyway, I don't know if you'll ever get to try it, but it might be worth keeping in mind.

EDIT: After thinking a little more about it, you'd want this to accept large edit distances (you mention a 55% match in the blog post). That would probably blow up the size of the automaton, so this approach might not work for you.

antlarr commented 6 years ago

That's a very interesting approach. I'll for sure read more about Levenshtein automatons and see if they can help improve the comparisons.

About the 55% match level I mention in the blog post, yes, I've set 55% as a "useful" canceling threshold because I can be sure anything below that is not a match. But not everything above that is a match neither. What I'm doing is storing the values of all matches above 55% in the database, so I can do fast queries afterwards for "songs having a >90% match with song X" or "songs having a >70% match with song Y" using strictly sql queries.

Thanks for the suggestion!

lnicola commented 6 years ago

That's a very interesting approach. I'll for sure read more about Levenshtein automatons and see if they can help improve the comparisons.

Yeah, I don't think they work out too well with edit distances larger than e.g. 5 or so.

What I'm doing is storing the values of all matches above 55% in the database

Ah, okay, then you still need the pairwise comparison.

lnicola commented 6 years ago

Another idea: on x64 use int64_t as it should be faster. There's also the usually machine word-sized int32_fast_t which you can try to use to avoid conditional compilation, but I don't know how well that plays with __builtin_popcnt(l).

Just curios, how long is one of those fingerprints? I noticed that you seem to offset the fingerprints by entire ints instead of bits.

antlarr commented 5 years ago

Usually each fingerprint has 948 integer values . It can have less is the song is shorter than 2 minutes and it shouldn't be longer than that since I only calculate the fingerprint for the first two minutes of the song (but it could be longer if the limit is changed in the future). And yes, the offset for the fingerprints are just done by entire ints because of the way fingerprints are calculated. Put in another way, you can interpret each int as the characteristics of a small section of the song, so it's not actually possible to offset the fingerprints by some bits only.

When I find some time, I'll try to join each pair of ints and use int64_t as parameter of __builtin_popcnt which should be a bit faster. Thanks for the idea.

antlarr / bard

Fingerprint matching #1