Fetch all search results on the frontend

geohacker commented 1 year ago

Looking at features like #54, I'm starting to think what least effort implementation for aggregates and filters would be.

Milvus total results per search is 65k. We will still want to limit this to say 20-30k. Assuming that anything after 30k is quite possibly not results
We could stash the results into sqlite. Each search gets an sqlite on the backend and further filters and aggregates happen there. This is a nice idea but a bit cumbersome. We could store all metadata in pgsql but that's a lot of overhead to keep two databases in sync, create views etc.
If we look at the size of search results from milvus it's about 27MB for 65k results, uncompressed

What if we can:

Progressively load all results. API limits page size to 5k. So make 4 progressive requests, load the JSON into frontend state
Use map/reduce to then implement aggregates and filters

This overall will create a nice user experience. Aggregate and filters will be instant, downloads are fast, we can do some nice filters, and counts. The load is optimised by the limit on the total number of results from milvus. There are no other implications to our existing infrastructure.

@oliverroick what do you think? This is a bit more work on the frontend and could put our budget in trouble but let's see what's possible.

cc @developmentseed/google-bioacoustics

oliverroick commented 1 year ago

Technically all of this is possible, it can definitely be built. It will add some hours though so we need to consider that.

I'm going to play the hater now so we're considering what we're getting into:

If we build this then I'd suggest we go through the outstanding tickets and prioritise. I think we're already closing in on a presentable MVP but some big tickets are still open and we should at least agree on the order of execution of:
- Filters
- map views
- pagination
- filter for distinct files
- bug fixes
I'm a bit uneasy about holding 27MB worth of results in memory and filtering/sorting over the whole dataset. I have no experience working with that load so I don't know how performant that will be. I can experiment with that if we want to go down that route but we'll need the API to offer that amount of data so we build a better understand of what this will look like.
Currently the API takes about 4-5 seconds to return 100 matches. How will returning 5000 results affect that number? It's important to consider: Even if the response time doesn't increase we still need to wait up to 20 seconds to show results. I don't think that's acceptable. We might be able to parallelise the requests if we can agree on the pattern to request pagination beforehand. Ie. If we request a page using ...?page=2 or ...?offset=5000 then we parallelise, if for some reason we have to work with tokens then we have the query iteratively.

All this probably also needs input from @willemarcel and @leothomas

Side note: Should we use discussions for things like this instead of issues. That way we can use issues for workable things and discussions for things we need to work out, and I have a nice overview of all the things that still need to be done.

willemarcel commented 1 year ago

In my view, the main question is how many results the users are interested in seeing. If they really need thousands of matches, we should find a way to give it to them.

I did some tests, filtering the results by unique file ids:

results	number of files	last item distance
5000	1032	5.37
4000	878	5.30
3000	713	5.22
2000	557	5.10
1000	338	4.88

The API returns 5000 results in around 12 seconds. We can reduce the size of the response by half if we remove the filename field (I believe we are not going to use it) and the image_url and audio_url fields (we can create it in the frontend).

LanesGood commented 1 year ago

I agree we should remove the filename field if we are able to link to the file in its original hosting environment. With filenames like 20220414T104948+1000_REC.flac it doesn't seem useful or necessary to include these in the API.

cc @oliverroick for composing image_url and audio_url - I know the frontend is currently composing these, so just a heads up that we might remove from API

oliverroick commented 1 year ago

I've implemented a paginated search today. The API is very slow at the moment and I managed the crash the server a couple of the time. For testing purposes I've limited the search to 10 pages with 10 items each.

The current implementation looks like this:

All requests are sent at the same time.
As soon as we get one response, we display the results.
We merge subsequent responses and sort by distance.

This works ok but it doesn't make for a good user experience.

https://github.com/developmentseed/bioacoustics-frontend/assets/159510/87e0474a-d90b-498d-a791-10aab8fc9c2a

As you can see, the order of results can change whenever there's a new response. Now if we assume a response doesn't return 10 but 5,000 records, we also should assume that users will start paginating and filtering and sorting the results list. Every new response can change and reset how the results are displayed.

How we can address this:

Make the API faster.
Return responses ordered by distance. That way we can request the page 1 first and all other pages afterwards, but that still affects filters.
Wait until we've received all responses. Even if we get one response down to 2 seconds, we still have to wait for up to 12 seconds to show results.

We're someone limited on the front-end in terms of improving the UX. I can only think of getting one page with 5,000 results but then the backend would have to return the 5,000 closest matches on page 1.

leothomas commented 1 year ago

In terms of making the API faster, it would be interesting to be able to analyse the query response timing (as Milvus passes the query from the scheduler node to the data nodes, etc). When running Milvus lite locally (no scheduling nodes, etc), the query returns 100 results in 0.18 seconds. We can improve this slightly by switching to a different indexing strategiy, or reducing the nprobe parameter, however accuracy will decrease in both cases. The amount of time reduction we could get on the query would be dependent on how Milvus handles the query internally, so I'm not sure exactly how much time we would save.

For reference, NASA Similarity Search pulls all the returned Ids from the FAISS index and writes them to a temp table in Postgres, which then left join against the metadata table to get all of the metadata the perform temporal/spatial aggregations. Temporal aggregations are performed using Postgres native datetime comparisons, and the spatial aggregations are performed by grouping sub-strings of quadkeys using postgres substring function.

Aggregating ~100k results at zoom level 6 and with monthly temporal buckets takes >20 seconds

developmentseed / bioacoustics-frontend

Fetch all search results on the frontend #64