JorenSix / Panako

The Panako acoustic fingerprinting system.
GNU Affero General Public License v3.0
182 stars 38 forks source link

PanakoStrategy Query Logic - maxListSize @ 250 needs an override #36

Closed lucaslawes closed 2 years ago

lucaslawes commented 2 years ago

Possible minor refactoring to improve the recognition rate.

Testing Results Running the query algorithm using a high-powered system found that taking half the query matches as the firstHits and lastHits (see below) results in a slightly better recognition rate.

Suggestion Add maxListSize to config.properties, maybe with a switch to allow the query algorithm to take half the query matches each time.

if(!overrideMaxListSize) {
  //view the first and last hits (max 250)
  int maxListSize = 250;
  firstHits = queryMatches.subList(0, Math.min(maxListSize,Math.max(minimumUnfilteredHits,queryMatches.size()/5)));
  lastHits  = queryMatches.subList(queryMatches.size()-Math.min(maxListSize, Math.max(minimumUnfilteredHits,queryMatches.size()/5)), queryMatches.size());
}
else { // Taking half and half seems to achieve a better recognition rate
  var numQueryMatches = queryMatches.size();
  var numQueryMatchesEvened = numQueryMatches % 2 == 0 ? numQueryMatches : numQueryMatches - 1;
  var batchSize = numQueryMatchesEvened / 2;
  firstHits = queryMatches.subList(0, batchSize - 1);
  lastHits = queryMatches.subList(numQueryMatchesEvened - batchSize , numQueryMatches - 1); 
}
JorenSix commented 2 years ago

Thanks for the bug report!

This is indeed a 'magic number' that should be set in the configuration settings. Having a switch in the configuration settings seems like reasonable thing to do indeed. Especially if performance or query time is less of an issue.

The reason to only take 250 is performance: calculating a median on a small list is more efficient than on a potentially very large list (half the hits could be a lot). Figure 1 in the Panako 2.0 article shows exactly the idea. Impact on retrieval rate is expected to be limited but not thoroughly tested and might differ from one application to an other: in noisy settings many spectral peaks might be present in the query but not be in the reference database and 250 might be not enough to get 'agreement': a relevant median. Also a reason to add it to the configuration settings.

JorenSix commented 2 years ago

The last commit should allow the requested functionality:

By setting PANAKO_HIT_PART_MAX_SIZE to a very high number (Integer.max_value) and PANAKO_HIT_PART_DIVIDER to 2 the first and last part should be equal to:

var numQueryMatches = queryMatches.size();
var numQueryMatchesEvened = numQueryMatches % 2 == 0 ? numQueryMatches : numQueryMatches - 1;
var batchSize = numQueryMatchesEvened / 2;
firstHits = queryMatches.subList(0, batchSize - 1);
lastHits = queryMatches.subList(numQueryMatchesEvened - batchSize , numQueryMatches - 1);