Discussion: Unpaginated Search Results and Supporting a Dynamic Search Results Map Layer

whatisgalen commented 11 months ago

The default behavior in the Arches SearchView is to paginate the search request to the appropriate django view, search.py. While this UX is natural to a paginated UI component like the Search Results component, it presents a fundamental constraint to an inherently unpaginated UI component like the Map Filter. The available UX in the Map Filter component enables a user to either see every single resource instance of a given resource model via the resource layer, a page-limit’s worth of search results from their query, or a combination of both.

Not having an unpaginated Map UX is a huge missed opportunity for Arches given that the map filter highlights the first-class status of mappable data, and given the robust power of the combinability of the other search filter components. Why hasn’t it already been done? Well there are technical challenges, of course.

We implemented the desired functionality in HistoricPlacesLA and there were a few core changes to make it run in a timely manner:

Enable Backend Caching of Search results For the largest queries by result volume, many queries were repeated by a lot of the users. For “Anonymous” users in particular, given that they all had identical data permissions, it made sense to cache one set for all of them. The cache key would have to capture the whole query and also include some snapshot of the database in case search results had been mutated. We used the count from the EditLog model. Determining when these caches timeout (outside of the snapshot being embedded in the cache key) is a matter of preference and should be its own setting at the project level. Without embedding backend caching, the demand on Elasticsearch could increase significantly given the desired load (see next point).
Enable Unpaginated Search results To paint all the features on the map, clustered or in whatever layer you style, a search query would need to request all search results instead of a single page’s worth. This means the default query in SearchView should be able to interrogate the instance settings for whether to paginate every search query. This means the search view should be able to handle the case of no paging-filter present on the query object.
Enable Frontend & Backend Caching of Search Result Geometries Because the FeatureCollections of points and polygons tend to eat up compute during JSON conversion, there’s a lot of utility in caching a “master” lookup of geometry to lighten the payload of any subsequent query’s search results. In HistoricPlacesLA, a compressed geometry lookup for 50,000 mixed point/polygons can be cached to the browser at about 3mb. There is a hard limit of 5mb since browsers don’t support larger browser caches and a real tradeoff to consider regarding whether it’s better to frontload first to save time later vs waiting a few seconds longer for those larger geom-heavy search results. This should be its own optional flow to be configured by the implementer, but it is a useful pattern that merits being available rather than reinvented.
Create a Map Layer to cluster Search results The behavior of a dynamic search map differs from both the search-result layer and the resource-layer stylings. A functional map layer for unpaginated search results should by default have clustering behavior similar to the resource-layer. There isn’t a compelling reason not to enable UI-supported styling for this overlay map-layer, but in the near term it could be styled directly in the json similar to any other overlay.

Example Implementation: HPLA HistoricPlacesLA has over 50,000 resources with geometries. For HPLA, our overall strategy was to cache both: A) a “master” set of search result geometries on the backend and B) literally every search result set. When a user lands on the home page (index.html), the page would request the cached geometry lookup from the backend via search_results and then compress and cache this set in the client’s browser localStorage cache. Typical wait time for this to complete was 3-8 seconds. Half of the runtime goes to compression of the geom lookup.

When the user proceeds to Search, their search query returns an unpaginated result set sans any point or polygons (best case from the backend search result cache). On the frontend, each search result gets compared against the frontend geom cache and if matched, gets “hydrated” with a cached geom. The ligher payload (even when no cached results exist) makes a noticeable difference in load time.

The end product is a dynamic, unpaginated map query that does what you expect it to. While the default functionality in SearchView runs in less than 1 second, for any project that has even half of the geoms HPLA has, the payoff of a dynamic map will likely merit the extra 2-3 seconds spent waiting.

I think there’s a strong case to be made that this functionality deserves to be “opt-in” in core Arches.

robgaston commented 11 months ago

Related to this, though not well documented, Arches provides data sources on the search map that represent tunable geohash aggregations of the entire search result set. these are similar to, though not exactly, the suggested clustered search results layer without the additional overhead of the described solution. this is not to say that the suggested solution is not appropriate, but simply that we should acknowledge this existing functionality intended to support showing the entire aggregated result set on the map in this discussion.

for reference, Geohash Grid aggregations effectively return cells (whose geographic size can be tuned) with result counts in each cell. this can be useful in conveying density (using heat maps, or bins) but not exact result locations.

robgaston commented 11 months ago

I am wondering if anyone has explored using Geotile grid aggregations and if this solution might provide an even better fit here? seems worth some investigation, but it only seems to return tile {z}/{x}/{y} locations and counts... i think some other service to retrieve queried tiles may be necessary to actually visualize the data and I'm not sure if/how elasticsearch provides that

EDIT I think the Vector tile search API may provide a way to resolve the actual tile data...

chiatt commented 11 months ago

I think it would be great to show all search results in a single request, but 50k results, although not small, is still relatively modest. Projects with over 150k geoms are not uncommon and would probably exceed the 5mb local storage limit. Using geohash aggregations is a nice alternative because it can scale to much larger projects. That's probably true for Vector Tile Search as well.

Either way, backend caching of ES results by user to reduce the need for repeated permission filtering (as @whatisgalen suggests) seems like it could improve performance without too much effort.

whatisgalen commented 11 months ago

I think it would be great to show all search results in a single request, but 50k results, although not small, is still relatively modest. Projects with over 150k geoms are not uncommon and would probably exceed the 5mb local storage limit. Using geohash aggregations is a nice alternative because it can scale to much larger projects. That's probably true for Vector Tile Search as well.

I agree that the projects with extra large geoms will need a different approach, and I also think that the majority of Arches Implementations out there are well below even LA's 50k limit. There's a particular un-intuitiveness to the default UX for search results on the map that could be resolved with little to no performance hit for smaller Arches projects.

robgaston commented 11 months ago

I would hope that a vector tile based approach would work for everyone. the problem with loading the geometries up front that I see is that it blocks the initial load of the map. Vector tiles would defer the loading of map data without blocking the initial load. each tile request should go through relatively quickly, and since they are simplified users would never be required to load every coordinate pair in their data set to see geometries on the map.

The hurdle i see is this: we can't really control the requests that mapbox gl js makes to a vector tile service at a low level, so the search parameters (ie the generated DSL) will likely need to be stored somehow on the backend (per session? user probably wont work since anonymous users share a user) and recalled when a user requests the tiles in a Django view that will return the vector tiles. I am working through a little proof of concept storing these per user to see if this all is doable, but the actual solution will need to account for all sessions including anonymous users. I think that effectively caching the DSL will also optimize the performance here so that it doesn't need to be regenerated per each tile request.

mradamcox commented 11 months ago

Just in case it's helpful, I'll describe something similar that we have in the FPAN HMS Arches implementation, which is tied to a rule-based, resource-level permission system we have implemented through a custom Search Component. Different user accounts need to be able to see only certain subsets of a certain resource model's resource instances, so I wanted the Resource Overlay layer geometries to always reflect the current user's permissions.

Essentially, it's accomplished by acquiring a list of valid resource ids via the custom Search Component I made, and then injecting a where clause with those ids into Postgres query in the MVT() view.

What this means is that there is actually a second ES query run from within the MVT view that returns only resources ids, which I thought may be a problem but it hasn't been at all. As it's set up, there is a hard 10k limit on this query, but I suppose it could be paginated easily enough for higher numbers.

This has worked well for us. As far as numbers of geometries go, full number of resources is 40k (these show up on the map without clustering without any trouble at all) and the highest subset that is requested is probably in the 5k range (i.e. 5k individual resource ids being passed through that where clause).

robgaston commented 11 months ago

@mradamcox that's a very interesting idea and may work well as a solution here... a thing I like about it is that the overhead is all built into the tile requests, which means no user interaction is blocked.

It seems that the Elasticsearch Vector Tile API would work very well, but for one sticking point: you cannot do aggregations against geoshape data without a paid subscription. So in my little proof of concept, i was able to make it work, but it would only work with our indexed points, and if i pointed it to the complete geometry i would get the following error: AuthorizationException(403, 'search_phase_execution_exception', 'current license is non-compliant for [geotile_grid aggregation on geo_shape fields]') so it seems like this approach would be great for users who can provide such a license, but pretty severely limited for those who cannot.

mradamcox commented 11 months ago

Yeah, I figured the more I could throw at postgres the better, just didn't have the wherewithal to actually dig to deeply into the SQL... so a really long where clause of resource ids is what I ended up with. I also wanted to be generating the custom ES DSL rule in only one place, so this allows the same logic to 1) be injected into the main search results as a hidden filter, and 2) be reused for the geometries here.

chiatt commented 11 months ago

Another nice advantage of leveraging the MVT view is that the results are cached by user

mradamcox commented 11 months ago

I did try that out but actually have a permanent "cache bust" because I would need to invalidate caches if a user's permissions were updated and was not able to look into setting that up. But yeah, there is a lot of potential for caching tiles based on different criteria of the search itself.

whatisgalen commented 10 months ago

It seems that the Elasticsearch Vector Tile API would work very well, but for one sticking point: you cannot do aggregations against geoshape data without a paid subscription. So in my little proof of concept, i was able to make it work, but it would only work with our indexed points, and if i pointed it to the complete geometry i would get the following error: AuthorizationException(403, 'search_phase_execution_exception', 'current license is non-compliant for [geotile_grid aggregation on geo_shape fields]') so it seems like this approach would be great for users who can provide such a license, but pretty severely limited for those who cannot.

Could a workaround using this solution be to

make the minZoom of the search layer polygons be something really high like 18 and
have a query or filter of some kind against the vector tiles of the resource layer?

robgaston commented 10 months ago

@whatisgalen if you could store the DSL query on the backend, then you could have a search MVT service that could recall that DSL and query elastic for ids for all results and use that in a SQL filter similar to what @mradamcox described. the zoom level bit shouldn't really matter then because you will get the same number of requests per zoom level based on the size of the map, and the payload size should be constrained by the simplifying logic in generating the vector tiles. I'm sure there are exceptions where the payload size might get quite large, but seems like you wouldn't need to limit the zoom levels by default.

one possible idea could be hashing the query DSL and including the hash as a a token in the vector tile service URL somehow. this way the DSL only needs to be generated once (and the token returned with the paged results) but could be reused by each vector tile request. something like search_mvt/{token}/{zoom}/{x}/{y}

whatisgalen commented 1 week ago

check out this branch that uses a geotileGrid aggregation to do clusters below zoom level 14 and then individual features above that, all from the index. It's testable!

archesproject / arches

Discussion: Unpaginated Search Results and Supporting a Dynamic Search Results Map Layer #10502