dpwe / audfprint

Landmark-based audio fingerprinting
MIT License
544 stars 122 forks source link

use hash bits more effectively #69

Open db48x opened 5 years ago

db48x commented 5 years ago

Audfprint uses most of the possible hash values, but leaves out a few. Fixing this improves recognition rates while storing the same number of hashes, or allows you to reduce the number of hashes stored while keeping acceptable recognition rates.

Of course, this is not backwards-compatible; queries generated with these fixes won't match very well against databases generated without them. I've not made any changes to the hash table code here, so these get saved with the current version number.

dpwe commented 5 years ago

Any data on how much this helps? It seems like this is going to be single-digit percent in hash utilization, and it that doesn’t directly translate into improved recognition or discrimination.

Using a non-power-of-2 FFT size might make the FFT much, much slower. I think the default FFT is 2^9 = 512. That plus 2 is 514, whose prime factors are 2 and 257. So it’s essentially a non-fast FT over 257 points plus a few tweaks, which might be ~20x slower than 512 point.

DAn.

On Sun, Sep 22, 2019 at 20:19 Daniel Brooks notifications@github.com wrote:

Audfprint uses most of the possible hash values, but leaves out a few. Fixing this improves recognition rates while storing the same number of hashes, or allows you to reduce the number of hashes stored while keeping acceptable recognition rates.

Of course, this is not backwards-compatible; queries generated with these fixes won't match very well against databases generated without them. I've not made any changes to the hash table code here, so these get saved with the current version number.

You can view, comment on, or merge this pull request online at:

https://github.com/dpwe/audfprint/pull/69 Commit Summary

  • use all bits of dt in the hash
  • use all bits of df in the hash
  • compute two additional fft bins, then discard the bottom and top bin

File Changes

  • M audfprint_analyze.py (19)

Patch Links:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dpwe/audfprint/pull/69?email_source=notifications&email_token=AAEGZUJRCYNDSXGGG3Z4JVTQLADRHA5CNFSM4IZFIC22YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HM5LUOA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEGZULDR4UUS27USL2446TQLADRHANCNFSM4IZFIC2Q .

db48x commented 5 years ago

Distribution of df values in a database created from a small sample of audio, before and after these fixes:

hash_uniformity_check_df

Same for dt:

hash_uniformity_check_dt

And likewise for f:

hash_uniformity_check_f

You can see that there's still significantly more occurrences of some values that others; improving that even more would likely help.

db48x commented 5 years ago

Any data on how much this helps? It seems like this is going to be single-digit percent in hash utilization, and it that doesn’t directly translate into improved recognition or discrimination.

With these changes I've reduced the number of hashes that I need to keep by ~40%.

Using a non-power-of-2 FFT size might make the FFT much, much slower. I think the default FFT is 2^9 = 512. That plus 2 is 514, whose prime factors are 2 and 257. So it’s essentially a non-fast FT over 257 points plus a few tweaks, which might be ~20x slower than 512 point.

I hadn't considered this, and I'm not really measuring cpu performance at all. Performance while creating fingerprints is fine, and querying is slow but not as far as I can tell slower than before.

db48x commented 5 years ago

The total loss of hash values is 9%. Before the fixes this sample of audio resulted in 895846 unique hashes (out of 2^20 possible hash values); after the fix it results in 983840 unique values. The remaining missing values are due to the dependency between f and df; when f is low, then df is likely to be positive (if f is zero, then df cannot be negative at all). I've been thinking about a fix for that as well, but I've not tried it yet.

dpwe commented 5 years ago

More fully occupying the hash space is nice, but it seems second-order. It certainly doesn't translate into needing fewer hashes to maintain match accuracy. I don't mean to dismiss your improvements, but a 40% reduction in hashes is huge, and I'd really love to understand what it relies on. I can imagine that setting a larger minimum time-difference for hashes might exclude a lot of high-occurrence, low-value hashes, I wonder if that's it?

One problem is that it's very hard to measure "efficiency" here (even ignoring the fact that the existing implementation doesn't gain much from a less-full database). The real performance limit is the false alarm - false reject tradeoff, but the false alarm curve is very steep: you get almost none, then if you drop your --min-count threshold too low, you get lots.

It might be interesting to look at statistics of the number of hashes at each time/frequency offset. I'm not sure how best to illustrate that 3D dataset, but even just looking at a histogram of overall hash frequency vs. overall hash count might be informative. We could implement a pruning strategy that fills the hash table for a while, then identifies the most popular hashes, then "blacklists" them. Eliminating a small fraction of the hash values could result in removing a large portion of the database, which could in theory save space (for some alternative hash storage mechanism) and will also accelerate matching (because fewer matching hashes have to be tallied). The question is how much it hurts retrieval.

DAn.

On Mon, Sep 23, 2019 at 12:47 AM Daniel Brooks notifications@github.com wrote:

The total loss of hash values is 9%. Before the fixes this sample of audio resulted in 895846 unique hashes (out of 2^20 possible hash values); after the fix it results in 983840 unique values. The remaining missing values are due to the dependency between f and df; when f is low, then df is likely to be positive (if f is zero, then df cannot be negative at all). I've been thinking about a fix for that as well, but I've not tried it yet.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dpwe/audfprint/pull/69?email_source=notifications&email_token=AAEGZULPJKOII7RBSVRXBD3QLBC5RA5CNFSM4IZFIC22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JZABI#issuecomment-533958661, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEGZUOXBJQZCOMO3JGEXZDQLBC5RANCNFSM4IZFIC2Q .

db48x commented 5 years ago

More fully occupying the hash space is nice, but it seems second-order. It certainly doesn't translate into needing fewer hashes to maintain match accuracy. I don't mean to dismiss your improvements, but a 40% reduction in hashes is huge, and I'd really love to understand what it relies on. I can imagine that setting a larger minimum time-difference for hashes might exclude a lot of high-occurrence, low-value hashes, I wonder if that's it?

It may be that the gains more than expected because my audio is very clean. The queries aren't recorded on a cell phone in a nightclub, they come directly from the corpus. We're not looking to duplicate Shazam, we're looking to build a way to find all the television and radio programs that played a commercial, or jingle, soundbite, etc.

That said, I think the dt component has the same minimum value of 2 time hops as before; I simply encode that as a zero and extend the top end of the range to fill out the space in the hash. You can actually see that in the graph.

One problem is that it's very hard to measure "efficiency" here (even ignoring the fact that the existing implementation doesn't gain much from a less-full database). The real performance limit is the false alarm - false reject tradeoff, but the false alarm curve is very steep: you get almost none, then if you drop your --min-count threshold too low, you get lots.

Agreed, but as I've said I've never even seen any false-positives. The closest I've seen is matches between very closely related pieces of audio: two versions of a commercial where the start and end are identical, but there's an extra line of dialogue in the middle of one version. It may be that the 40% reduction is overly optimistic, and

Eliminating a small fraction of the hash values could result in removing a large portion of the database, which could in theory save space (for some alternative hash storage mechanism) and will also accelerate matching (because fewer matching hashes have to be tallied).

This is probably true; some hashes occur in more than half of the files in a dataset of 2k files that I've tested with:

hash_occurancy_check