Closed Laurian closed 5 years ago
More specific, I can store in a database the data which def store(self, name, timehashpairs)
is called with; what I'm not sure is how to plug my database data into def get_hits(self, hashes)
function.
Should I skip any hashmask/timemaks bits? I understand those were needed for the current limitations of the hash table.
In principle, the hash table builds an index that, for each of the 2^20 hash values, stores the tracks that include that hash (and the time frames within those tracks at which the hash occurs). hashtable.store(track_name, time_hash_pairs_in_track)
records all the hashes for a particular track (specified by name, but internally represented by an index); hashtable.get_hits(time_hash_pairs_in_query)
gathers all the tracks that include each of the hashes present in the query by merging each of per-hash lists in the database.
To convert this to a different database, you set up the database to store (track_id, time_frame_in_track) pairs indexed by each hash. Then, on get_hits, you go through each hash, retrieve the list of associated (track_id, time_frame) pairs, then calculate the per-hash per-track data row, which contains (track_id, time_frame_in_track - time_frame_in_query, hash, time_frame_in_track).
The second value (time frame difference) is computed for convenience: this is the value that will be approximately constant for multiple hashes matching from a common piece of audio. The match logic then looks, for each matched track_id), for the time difference with the greatest number of hashes.
The hashmask/timemask stuff is just to allow me to store the (track_id, time_frame_in_track) pairs in a single int32. If you're going full-database, you probably don't want to mess with that level of optimization, just store the pairs as-is.
Hi,
I'm looking into delegating some of the scalability issues to a known database, for now MySQL.
I can read fingerprints (with your code) and store them in MySQL (using some dejavu db code); and I can read hash matches back: https://gist.github.com/Laurian/7869355a000c803f26bb434935a367cb#file-test-py
I'm struggling with how to feed those hash matches back into your further processing as I don't quite follow the magic around
hashmask
,timemask
and some of the numpy operations you do.How would you recommend approaching alternate hashtable implementations?