gbif / occurrence

Occurrence store, download, search
Apache License 2.0
22 stars 15 forks source link

Very wide, complex multipolygons are very slow to process #268

Closed MattBlissett closed 2 years ago

MattBlissett commented 3 years ago

This download's Hive query took almost 12 hours to complete (183 days CPU), because the complex polygon optimization didn't do anything useful.

The optimization assumes (as in the test cases) a multipolygon would be something like an island group, so GTE/LTE predicates added for the bounding box would exclude most data.

This multipolygon covers most of the world, so there was little benefit. A disjunction of bounding box queries from the polygons within the multipolygon would be better, or maybe converting multipolygons in Within queries to a disjunction of polygons prior to the existing optimization.

timrobertson100 commented 3 years ago

We might also explore if the ESRI UDF does optimizations natively too.