dpwe / audfprint

Landmark-based audio fingerprinting
MIT License
538 stars 121 forks source link

Trying to use a database #29

Closed Laurian closed 5 years ago

Laurian commented 6 years ago

Hi,

I'm looking into delegating some of the scalability issues to a known database, for now MySQL.

I can read fingerprints (with your code) and store them in MySQL (using some dejavu db code); and I can read hash matches back: https://gist.github.com/Laurian/7869355a000c803f26bb434935a367cb#file-test-py

I'm struggling with how to feed those hash matches back into your further processing as I don't quite follow the magic around hashmask, timemask and some of the numpy operations you do.

How would you recommend approaching alternate hashtable implementations?

Laurian commented 6 years ago

More specific, I can store in a database the data which def store(self, name, timehashpairs) is called with; what I'm not sure is how to plug my database data into def get_hits(self, hashes) function.

Should I skip any hashmask/timemaks bits? I understand those were needed for the current limitations of the hash table.

dpwe commented 5 years ago

In principle, the hash table builds an index that, for each of the 2^20 hash values, stores the tracks that include that hash (and the time frames within those tracks at which the hash occurs). hashtable.store(track_name, time_hash_pairs_in_track) records all the hashes for a particular track (specified by name, but internally represented by an index); hashtable.get_hits(time_hash_pairs_in_query) gathers all the tracks that include each of the hashes present in the query by merging each of per-hash lists in the database.

To convert this to a different database, you set up the database to store (track_id, time_frame_in_track) pairs indexed by each hash. Then, on get_hits, you go through each hash, retrieve the list of associated (track_id, time_frame) pairs, then calculate the per-hash per-track data row, which contains (track_id, time_frame_in_track - time_frame_in_query, hash, time_frame_in_track).

The second value (time frame difference) is computed for convenience: this is the value that will be approximately constant for multiple hashes matching from a common piece of audio. The match logic then looks, for each matched track_id), for the time difference with the greatest number of hashes.

The hashmask/timemask stuff is just to allow me to store the (track_id, time_frame_in_track) pairs in a single int32. If you're going full-database, you probably don't want to mess with that level of optimization, just store the pairs as-is.