Robustness of acoustid_compare regarding audio playback speed

I recently discovered that both acoustid_compare2 and acoustid_compare3 fail to match fingerprints if the playback speed of the source audio is off by just a small factor. A minor difference in playback speed isn't unusual for recordings digitalized from Vinyl.

As an example, I tried to match a particular audio file containing this recording. The following five fingerprint ID's are currently linked to this recording: 17118991, 24689897, 28902363, 30000124, 42081329

I obtained the fingerprint of the audio file using fpcalc -raw -signed and compared it to the above listed fingerprints. Here are the results:

    id    | acoustid_compare2 | acoustid_compare3 
----------+-------------------+-------------------
 17118991 |                 0 |        0.26943845
 24689897 |                 0 |        0.19000527
 28902363 |                 0 |         0.1888845
 30000124 |                 0 |        0.20899262
 42081329 |                 0 |        0.21097046

By manually comparing the audio file against the YouTube version of that same recording, I found out that the playback speed of the file was about 102.5 % compared to the YouTube version. Hence, I used Audacity to apply a speed multiplier of 0.975 to the audio file, obtained a new fingerprint with fpcalc, and compared it to the database fingerprints again:

    id    | acoustid_compare2 | acoustid_compare3 
----------+-------------------+-------------------
 17118991 |          0.901921 |        0.90186286
 24689897 |        0.60169214 |         0.6342959
 28902363 |         0.5070134 |        0.58906907
 30000124 |         0.9222046 |         0.9222046
 42081329 |         0.8562764 |         0.8573312

Now the fingerprints match as expected! Given these results, I wonder whether it's possible to improve acoustid_compare so that it will better tolerate minor playback speed differences, in order to generally improve fingerprint matching robustness?

I've never experimented with this, but when designing the fingerprint structure, I explicitly ignored cases like this, since they require the full fingerprint to be considerably bigger. I was focusing on the simple use case of near-identical audio matching.

For example, in the current fingerprints, one item (32bit hash) includes almost 1.5s of audio. That's close to usable if you want to match audio with different playback speeds. I guess the hashes will still have some bits in common, so it could be theoretically possible to get better results than with the current simplistic approach, but with a high cost in terms of performance and the results would not be particularly good.

It would only make sense if you needed to reuse the AcoustID database. Otherwise, it's probably easier to just design a new fingerprint structure. I've had plans for doing that for a very long time, but there is not enough motivation. :) It would basically utilize some of the techniques from SURF and related algorithms.

I explicitly ignored cases like this, since they require the full fingerprint to be considerably bigger

Makes sense. I'd like to point out though that only the lookup queries would need to be bigger, but not the database fingerprints themselves. As a naive implementation, the lookup query could simply contain an array of fingerprints, all computed from the same audio data source but with minor speed adjustments applied prior to the fingerprinting. Of course, this would come at a (high) performance cost.

Is there an fpcalc argument to adjust the number of audio samples used to compute each of the 32 bit hashes (i.e. chunk duration)? I guess this would effectively yield the same (or at least very similar) result as applying a "true" audio playback speed adjustment?

I guess the hashes will still have some bits in common

I might take a closer look on this and do some more experimenting.

After doing some experiments regarding this matter, I think it's possible to improve the robustness of acoustid_compare regarding audio playback speed. I also think it could be done with the existing AcoustID database. Here are my findings:

The hashes generally have many bits in common if the playback speed is just slightly off (e.g. less than 5 %).
The comparison function can be greatly improved by not only considering an audio alignment offset, but also a scaling factor. Generally speaking, we should compare fingerprint hash fp1[i] against fp2[i * factor + offset]. As an example, let's consider this fingerprint comparison, where we compare the fingerprint of an audio file against a speed-adjusted version of that same file:

After rescaling the first fingerprint vector to the same speed as the second fingerpint vector, we have a decent match:
The matching can be further improved if we ignore certain bits within the rescaled fingerprint whose values aren't clear due to scaling inaccuracies.
Finding the best-matching alignment offset and speed factor is easily and quickly possible by using an iterative algorithm. (I can provide my experimental code if that is of any interest.)
For testing purposes, I used Audacity to create multiple speed-adjusted versions of the same audio file (with playback speed factors ranging from 97 % to 103 %) and tried to match those speed-adjusted versions against the original audio file, using different comparison functions:
- acoustid_compare2 and acoustid_compare3 are the existing implementations.
- acoustid_compare3_with_scaling is the same as acoustid_compare3, but with fingerprint scaling applied (as outlined in point 2 above).
- acoustid_compare3_with_scaling_and_matchmask is the same as acoustid_compare3_with_scaling, but with an additional scaling correction applied (as outlined in point 3 above).
Comparing two completely different audio files (i.e. testing the "true negative" scenario), both acoustid_compare3_with_scaling and acoustid_compare3_with_scaling_and_matchmask yield similar results as the existing acoustid_compare3 (but further testing needed):
Within my limited testing, the performance of both acoustid_compare3_with_scaling and acoustid_compare3_with_scaling_and_matchmask was actually a little faster than the existing acoustid_compare3. This is because they actually do a few less hash comparisons than acoustid_compare3 when it comes to finding the best-matching alignment offset and speed factor.

To summarize, it seems like it would be possible to implement this without changing the existing database or fingerprint structure, but it certainly would require (a lot) more work and testing.

acoustid / pg_acoustid

Robustness of acoustid_compare regarding audio playback speed #8