Filter by top result per recording

LanesGood commented 1 year ago

When results are returned, the frontend should be able to filter results by unique recordings. This will permit the view of results where only the result with the lowest similarity distance for each recording is displayed.

Same as developmentseed/bioacoustics-api#23

LanesGood commented 1 year ago

Rationale: when I execute a search I may want to see distribution of similarity across all recordings.

The default search result will return all results across the entire index, many of which will occur in the same recording. Given a two hour recording, a single bird/animal call is likely to show up multiple times.

The main use case for the default search is "show me all possible similar instances of this query."

The main use case for this proposed "best result per recording" search is "show me all recordings where a result similar to this query is present." If the results include duplicate instances from the same recording, users will need to manually parse and determine unique recording instances.

As discussed with @willemarcel and @geohacker, doing this on the API side would require holding all results (up to 65,000) in memory and then grouping and filtering on these. We may instead be able to explore doing this grouping/filtering on the frontend. This would of course be restricted to the single page of results currently in view.

cc @oliverroick

geohacker commented 1 year ago

Thank you @LanesGood. The more I think about this, i'm inclined to suggest that we treat filters somewhat more flexible? I don't know though. Would have to hear from @leothomas. Could this file id be another filter that we add to search?

So in case the user says 'oh find me only this file' we just run the search again with the file filter?

leothomas commented 1 year ago

Hey there! So if I understand correctly, what you're suggesting is:

Where a basic search returns a result set like:

1. distance: 1.0, id:111, fileId:abc
2. distance: 1.2, id: 112, fileId:abc
3. distance: 1.4, id: 222, fileId:def
4. distance: 1.5, id: 113, fileId:abc

We would want this filtered search to only return the top result per unique file in the result set:

1. distance: 1.0, id:111, fileId:abc
2. distance: 1.4, id:222, fileId:def

Is that correct, @LanesGood ?

We have both a file_sequence_id field and a filename field that we can execute this filter/grouping on. The only concern is that there's no way to enforce a number of unique file results. Each file is 2hrs 60min/hr 60sec/min / 5sec --> 1440 possible results per file. This means that if we request 100 results, and then filter by unique files, in the absolutely worst case scenario, we may need to make 14 requests before getting a second, unique file.

Could we consider implementing a "simplify" feature in the frontend, where a user would first execute a query, be presented with the result set, and then given the opportunity to groups the results by unique files (keeping the top result per file, which is easy to do, since the results are already sorted by distance)?

LanesGood commented 1 year ago

@leothomas yes, that's a correct understanding of the filtered search results. @geohacker I'm not sure that the use case you've proposed is what's requested here. Rather than just searching for a single file, users in this proposed feature want to see all possible matches - with only 1 matching result per file, up to the limit of (65,000?).

Why would we need to request only 100 results? My understanding of this use case is that we would want all possible results.

I do think your proposed "group by top result per file" feature is a good middle-ground. This could probably be selected before running the query as well, right? There would just be a bit of a delay while the grouping is applied.

leothomas commented 1 year ago

Why would we need to request only 100 results?

100 was an arbitrary number. When users search against milvus, they have to specify the number of results they want

This could probably be selected before running the query as well, right? There would just be a bit of a delay while the grouping is applied.

Definitely possible, but the issue is that they're no way to specify a uniqueness condition for a metadata field. So we have to first request X results from Milvus and then group unique files, which, at worst, reduced by a factor of 1440 to 1. So if the users wants X results, each from a unique file, the API needs to query 1440 * X results

leothomas commented 1 year ago

^ pigeon hole principle

geohacker commented 1 year ago

@LanesGood @leothomas @oliverroick let's pause on this one until we figure out next steps on #64 — this is a considerable lift and something that we haven't budgeted initially.

developmentseed / bioacoustics-frontend

Filter by top result per recording #54