developmentseed / bioacoustics-frontend

Frontend code for the Google Bioacoustics project that's used by A2O Search
https://search.acousticobservatory.org/
MIT License
2 stars 0 forks source link

Fetch all search results on the frontend #64

Closed geohacker closed 1 year ago

geohacker commented 1 year ago

Looking at features like #54, I'm starting to think what least effort implementation for aggregates and filters would be.

What if we can:

This overall will create a nice user experience. Aggregate and filters will be instant, downloads are fast, we can do some nice filters, and counts. The load is optimised by the limit on the total number of results from milvus. There are no other implications to our existing infrastructure.

@oliverroick what do you think? This is a bit more work on the frontend and could put our budget in trouble but let's see what's possible.

cc @developmentseed/google-bioacoustics

oliverroick commented 1 year ago

Technically all of this is possible, it can definitely be built. It will add some hours though so we need to consider that.

I'm going to play the hater now so we're considering what we're getting into:

All this probably also needs input from @willemarcel and @leothomas


Side note: Should we use discussions for things like this instead of issues. That way we can use issues for workable things and discussions for things we need to work out, and I have a nice overview of all the things that still need to be done.

willemarcel commented 1 year ago

In my view, the main question is how many results the users are interested in seeing. If they really need thousands of matches, we should find a way to give it to them.

I did some tests, filtering the results by unique file ids:

results number of files last item distance
5000 1032 5.37
4000 878 5.30
3000 713 5.22
2000 557 5.10
1000 338 4.88

The API returns 5000 results in around 12 seconds. We can reduce the size of the response by half if we remove the filename field (I believe we are not going to use it) and the image_url and audio_url fields (we can create it in the frontend).

LanesGood commented 1 year ago

I agree we should remove the filename field if we are able to link to the file in its original hosting environment. With filenames like 20220414T104948+1000_REC.flac it doesn't seem useful or necessary to include these in the API.

cc @oliverroick for composing image_url and audio_url - I know the frontend is currently composing these, so just a heads up that we might remove from API

oliverroick commented 1 year ago

I've implemented a paginated search today. The API is very slow at the moment and I managed the crash the server a couple of the time. For testing purposes I've limited the search to 10 pages with 10 items each.

The current implementation looks like this:

This works ok but it doesn't make for a good user experience.

https://github.com/developmentseed/bioacoustics-frontend/assets/159510/87e0474a-d90b-498d-a791-10aab8fc9c2a

As you can see, the order of results can change whenever there's a new response. Now if we assume a response doesn't return 10 but 5,000 records, we also should assume that users will start paginating and filtering and sorting the results list. Every new response can change and reset how the results are displayed.

How we can address this:

We're someone limited on the front-end in terms of improving the UX. I can only think of getting one page with 5,000 results but then the backend would have to return the 5,000 closest matches on page 1.

leothomas commented 1 year ago

In terms of making the API faster, it would be interesting to be able to analyse the query response timing (as Milvus passes the query from the scheduler node to the data nodes, etc). When running Milvus lite locally (no scheduling nodes, etc), the query returns 100 results in 0.18 seconds. We can improve this slightly by switching to a different indexing strategiy, or reducing the nprobe parameter, however accuracy will decrease in both cases. The amount of time reduction we could get on the query would be dependent on how Milvus handles the query internally, so I'm not sure exactly how much time we would save.

For reference, NASA Similarity Search pulls all the returned Ids from the FAISS index and writes them to a temp table in Postgres, which then left join against the metadata table to get all of the metadata the perform temporal/spatial aggregations. Temporal aggregations are performed using Postgres native datetime comparisons, and the spatial aggregations are performed by grouping sub-strings of quadkeys using postgres substring function.

Aggregating ~100k results at zoom level 6 and with monthly temporal buckets takes >20 seconds