BiologicalRecordsCentre / iRecord

Repository to store and track enhancements, issues and tasks regarding the iRecord website.
http://irecord.org.uk
2 stars 1 forks source link

Filtering by grid ref brings in records from adjoining grid refs #1570

Open kitenetter opened 1 year ago

kitenetter commented 1 year ago

On the explore and verification pages there is an option to filter by grid ref, which finds the records in the grid square requested but often brings in additional records from adjoining squares. Last time we looked into this it appeared to be the result of anomalies in the spatial querying, due to overlaps and slight misalignments between projections etc.

It may be possible to address this by developing enhancements to the ES search code, or using a query based on grid ref text strings.

(Issue originally raised under #1529)

johnvanbreda commented 1 year ago

I think the main issue here is simply that Elasticsearch's geo_shape query supports an intersects query but that includes grid squares that are adjacent. There is an option for a within query but that doesn't pick up squares that are inside but touch the edge reliably (presumably for similar reasons as the lines won't quite align due to projection differences). I've experimented with a within query based on a slightly enlarged grid square (by 10,000th of a degree, less than a metre) and that gave reasonable results, but may need more testing.

DavidHepper commented 1 year ago

This is unsettling. I didn't know it used a geo_shape but rather thought it would use OS grid ref manipulation based on the output map ref, such that input:

TL04

should match any of (for example):

TL04 TL04Y TL0741 TL078412 etc.

DavidHepper commented 1 year ago

I also have a suggestion for an efficient way to achieve this kind of query.

johnvanbreda commented 1 year ago

Any text matching algorithm would be tied to the area in which a particular grid notation is used - this won't matter much, but does mean that queries for grid squares around the gap between Northern Ireland and Scotland might seem odd in that they would be restricted to one side of an arbitrary dividing line. I suspect that's a minor issue as long as it's understood.

Feel free to send me any thoughts on the query process, but bear in mind that we are primarily concerned with Elasticsearch querying (although the code does support both ES and PostgreSQL). We can use a regular expression pattern match or a wildcard search in Elasticsearch but it is then an un-indexed search so will be relatively slow and costly. There is a new wildcard field type that optimises this (ES 7.9+) or we could use a separate indexed field of the 2 character 100km grid codes to narrow the search and improve performance - none of these options are trivial though.

DavidHepper commented 1 year ago

My suggestion is to recompute a field for each larger square whenever the Output Map Ref (OMR) of a record is recomputed: Myriad, Hectad, Tetrad, Monad, etc. are easily derived from a finer grid ref. Set the fields for even smaller squares to null. Index these and whenever a Where condition on Grid Reference is requested the test is a direct match against the field of the same size as the request data, no wildcards required. (This does not and should not match records with a vaguer OMR, though in theory the occurrence could be within the requested square.)

NBN Atlas gets these tests right but I've not asked about the algorithm. Here's an example link to find all twelve BDS Recording Scheme records within SK5447: https://records.nbnatlas.org/occurrences/search?q=grid_ref_1000%3ASK5447+AND+data_provider_uid%3Adp97&nbn_loading=true&fq=-occurrence_status%3A%22absent%22#tab_mapView

johnvanbreda commented 1 year ago

Thanks @DavidHepper, a sensible approach that would work in Elasticsearch. I will leave to @kitenetter to prioritise and we can compare the results of this approach vs the within query when doing the task.

DavidHepper commented 6 months ago

Bump.