gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Coordinate rounding should use seven digits #466

Open tucotuco opened 3 years ago

tucotuco commented 3 years ago

It is not clear where the choice to use six digits for rounding came from. Georeferencing best practices have said to use seven digits since 2001. The reason for this is to insure the preservation of transforms back and forth between coordinates systems or formats without coordinate drift. The effect of using six digits is that georeferences done using best practices will all come up with this flag, which is an unfortunate result for those actually going to the effort to do things right. I recommend changing the flag to use seven digits (https://github.com/gbif/pipelines/blob/4b2e511dc64cf38645f3bc79db9077b351c3502f/sdks/core/src/main/java/org/gbif/pipelines/core/parsers/location/parser/CoordinateParseUtils.java#L56).

See https://docs.gbif-uat.org/georeferencing-quick-reference-guide/1.0/en/#s-coordinate-format https://github.com/VertNet/georefcalculator/blob/eb1d7ad4b92a523c7e5c649764d1fbe9dfe896c4/source/python/point.py#L27

timrobertson100 commented 3 years ago

This is probably carried over from days when we had to reduce cardinalities for search technology. Originally we have 5DP (~1m precision) in very old generations of indexing on MySQL, but then when things progressed that was increased to 6DP (~10cm). My guess is that we assumed ~10cm precision was plenty for a global search system.

~Increasing to 7DP shouldn't be an issue since we geohash for most geo search now, but we should keep an eye on the batch map tile pyramid build performance. It needs to run every 2hrs in an acceptable time. If necessary we can apply the grouping in the map build but I suspect it won't be an issue.~

@MattBlissett correctly highlights it might impact cache hit ratios.

MattBlissett commented 3 years ago

I think we could keep 7 digits, but query the geocoding tables after rounding to 5-6 digits -- the source maps aren't accurate to centimetre precision anyway, and return multiple possible locations within several kilometres of any borders (ordered by distance from the point).

Otherwise, we would massively increase the processing needed for reinterpretation with new geo layers.

timrobertson100 commented 3 years ago

Otherwise, we would massively increase the processing needed for reinterpretation with new geo layers.

Should we verify this? We might find that rounding to 7DP doesn't actually increase the number of distinct points significantly (how many records are really having different coordinates within a radius of ~10cm).