Comparing short audio files

galarlo commented 1 year ago

Hi, I'm interested in finding near-duplicate audio files. My dataset is about 3000 thousands short audio files, between 0.5 seconds to 5 seconds. Unlike Shazam, both the "target" audio (i.e. the songs in Shazam's case) and the user input are short, and both might contain noise.

Can this library help? If so, are there any recommendations for tuning parameters?

N.B - if a file is matched to multiple other files, it's fine - I have a less efficient algorithm that can verify which match is correct. In other words, I can handle some amount of false positives, but I don't want false negatives.

dest4 commented 1 year ago

Hi, you want to tune the algorithm to have a high density of fingerprints. It will probably work in your use case.

The alternative in my opinion is to use cross correlation on the waveform itself, as the content is fairly short.

galarlo commented 1 year ago

@dest4 Thanks for the fast reply! I'm very glad to hear that the algorithm will probably work in my use case :)

How can I tune the density?

adblockradio / stream-audio-fingerprint

Comparing short audio files #27