Closed geohacker closed 1 year ago
Technically all of this is possible, it can definitely be built. It will add some hours though so we need to consider that.
I'm going to play the hater now so we're considering what we're getting into:
...?page=2
or ...?offset=5000
then we parallelise, if for some reason we have to work with tokens then we have the query iteratively. All this probably also needs input from @willemarcel and @leothomas
Side note: Should we use discussions for things like this instead of issues. That way we can use issues for workable things and discussions for things we need to work out, and I have a nice overview of all the things that still need to be done.
In my view, the main question is how many results the users are interested in seeing. If they really need thousands of matches, we should find a way to give it to them.
I did some tests, filtering the results by unique file ids:
results | number of files | last item distance |
---|---|---|
5000 | 1032 | 5.37 |
4000 | 878 | 5.30 |
3000 | 713 | 5.22 |
2000 | 557 | 5.10 |
1000 | 338 | 4.88 |
The API returns 5000 results in around 12 seconds. We can reduce the size of the response by half if we remove the filename field (I believe we are not going to use it) and the image_url
and audio_url
fields (we can create it in the frontend).
I agree we should remove the filename field if we are able to link to the file in its original hosting environment. With filenames like 20220414T104948+1000_REC.flac
it doesn't seem useful or necessary to include these in the API.
cc @oliverroick for composing image_url and audio_url - I know the frontend is currently composing these, so just a heads up that we might remove from API
I've implemented a paginated search today. The API is very slow at the moment and I managed the crash the server a couple of the time. For testing purposes I've limited the search to 10 pages with 10 items each.
The current implementation looks like this:
This works ok but it doesn't make for a good user experience.
As you can see, the order of results can change whenever there's a new response. Now if we assume a response doesn't return 10 but 5,000 records, we also should assume that users will start paginating and filtering and sorting the results list. Every new response can change and reset how the results are displayed.
How we can address this:
We're someone limited on the front-end in terms of improving the UX. I can only think of getting one page with 5,000 results but then the backend would have to return the 5,000 closest matches on page 1.
In terms of making the API faster, it would be interesting to be able to analyse the query response timing (as Milvus passes the query from the scheduler node to the data nodes, etc). When running Milvus lite locally (no scheduling nodes, etc), the query returns 100 results in 0.18 seconds. We can improve this slightly by switching to a different indexing strategiy, or reducing the nprobe
parameter, however accuracy will decrease in both cases. The amount of time reduction we could get on the query would be dependent on how Milvus handles the query internally, so I'm not sure exactly how much time we would save.
For reference, NASA Similarity Search pulls all the returned Ids from the FAISS index and writes them to a temp table in Postgres, which then left join against the metadata table to get all of the metadata the perform temporal/spatial aggregations. Temporal aggregations are performed using Postgres native datetime comparisons, and the spatial aggregations are performed by grouping sub-strings of quadkeys using postgres substring
function.
Aggregating ~100k results at zoom level 6 and with monthly temporal buckets takes >20 seconds
Looking at features like #54, I'm starting to think what least effort implementation for aggregates and filters would be.
What if we can:
This overall will create a nice user experience. Aggregate and filters will be instant, downloads are fast, we can do some nice filters, and counts. The load is optimised by the limit on the total number of results from milvus. There are no other implications to our existing infrastructure.
@oliverroick what do you think? This is a bit more work on the frontend and could put our budget in trouble but let's see what's possible.
cc @developmentseed/google-bioacoustics