Allow caching dataset geometry

mx-moth commented 1 year ago

Dataset geometry as provided in Format.polygons could be cached. This would allow quicker repeated operations on known datasets.

Possible interface

import emsarray
import emsarray.cache

dataset = emsarray.tutorial.open_dataset('austen')
emsarray.cache.dump_geometry(dataset, "austen.wkb")

import emsarray
import emsarray.cache

dataset = emsarray.tutorial.open_dataset('austen')
emsarray.cache.load_geometry(dataset, "austen.wkb")

Discussion

Geometry can be cached using a WKB GeometryCollection. This is stand alone and unencumbered (unlike Shapefiles), understood by many readers (less important, this is an 'internal' representation)...

Making polygons is one of the most expensive operations when opening a dataset, and most emsarray operations depend on geometry. Caching this makes sense. Should we leave open the option of caching more things? Perhaps caching things in a .tar, and each cached thing could be a file within there. Perhaps emsarray.cache.dump / emsarray.cache.load which calls each dump_foo / load_foo.

Do we bother with cache invalidation? i.e. if the model geometry has changed. Unsure how to do this without recomputing the entire geometry. Could possibly make a geometry hash based on the (emsarray version, format class, geometry variables)? As long as computing the hash is not detrimentally slow.

mx-moth commented 1 year ago

shapely.wkb.dumps() / shapely.wkb.loads() exist and work, with one caveat:

>>> import shapely.wkb
>>> from shapely.geometry import Polygon, GeometryCollection
>>> empty = Polygon()
>>> isinstance(empty, Polygon)
True
>>> empty.is_empty
True
>>> round_trip = shapely.wkb.loads(shapely.wkb.dumps(empty))
>>> isinstance(round_trip, Polygon)
False
>>> isinstance(round_trip,GeometryCollection)
True
>>> round_trip.is_empty
True

Empty polygons come back as empty GeometryCollections for Reasons. This is easy enough to detect so shouldn't concern us.

mx-moth commented 1 year ago

Caching geometry is no longer relevant, as polygon construction has been sped up dramatically by using new interfaces introduced in Shapely 2.0.0.

csiro-coasts / emsarray

Allow caching dataset geometry #40

Possible interface

Discussion