A proposed feature from a discussion with @frizwi. Some applications perform transformations on the geometry generated by emsarray such as triangulating the polygons, or otherwise derive application-specific data from the dataset geometry. These transformations can be computationally expensive. It would be beneficial if these derived data could be cached between different runs of the application, and if these data could be shared between different instances of the same dataset geometry for datasets partitioned in multiple files across time, for example. This proposal is for a new module emsarray.operations.cache which can generate a hash of the geometry of a dataset to assist applications in caching transformed geometry data. Specifically:
A new module emsarray.operations.cache with the method:
def make_cache_key(dataset: xarray.Dataset, hash_type: type[hashlib._Hash] = hashlib.sha1) -> str:
"""
Generate a key suitable for caching data derived from the geometry of a dataset.
Parameters
----------
dataset : xarray.Dataset
The dataset to generate a cache key from
hash : hash class
The kind of hash to use.
Defaults to `hashlib.sha1`, which is secure enough and fast enough for most purposes.
The hash algorithm does not need to be cryptographically secure,
so faster algorithms such as `xxhash` can be swapped in if desired.
Returns
-------
cache_key : str
A string suitable for use as a cache key.
The string will be safe for use as part filename if data are to be cached to disk.
Notes
-----
The cache key will depend on the Convention class,
the emsarray version, and a hash of the geometry of the dataset.
The specific structure of the cache key may change between emsarray versions
and should not be relied upon.
"""
A new method Convention.hash_geometry(hash) to hash the dataset geometry:
class Convention:
def hash_geometry(self, hash: hashlib._Hash) -> None:
"""
Update the provided hash with all of the relevant geometry data for this dataset.
This method must be deterministic based only on the dataset,
resulting in the same hash is called multiple times,
even across different instances of the Python interpreter.
Ideally the hash should be independent of any variable data
such that datasets partitioned across time in to multiple files on disk
will result in identical hashes.
Parameters
----------
hash : hashlib-style hash instance
The hash instance to update with geometry data.
This must follow the interface defined in :mod:`hashlib`.
"""
This method would generate a hash of the dataset geometry using whichever properties are relevant for that convention. A default implementation that hashes the name, data, and attributes of all variables in Convention.get_geometry_variables() might be appropriate. Specific Convention subclasses can either extend this to hash additional information such as any relevant global attributes, or provide an entirely separate implementation.
Applications can use this hash as a key when caching transformed data between different application instances or to reuse transformed data between different partitions of the same dataset:
class DatasetTriangulation:
def __init__(self):
self.triangulations = {}
def triangulate(self, dataset):
cache_key = cache.make_cache_key(dataset)
if cache_key in self.triangulations:
return self.triangulations[cache_key]
triangulation = triangulate.triangulate_dataset(dataset)
self.triangulations[cache_key] = triangulation
return triangulation
This feature proposal does not include any caching inside emsarray itself, either in memory or on disk. Future extensions to emsarray may use these functions to cache and reuse the generated geometry for datasets.
This feature proposal does not include any methods that cache arbitrary derived geometry data. Actual cache implementations are left to applications to implement.
A proposed feature from a discussion with @frizwi. Some applications perform transformations on the geometry generated by emsarray such as triangulating the polygons, or otherwise derive application-specific data from the dataset geometry. These transformations can be computationally expensive. It would be beneficial if these derived data could be cached between different runs of the application, and if these data could be shared between different instances of the same dataset geometry for datasets partitioned in multiple files across time, for example. This proposal is for a new module
emsarray.operations.cache
which can generate a hash of the geometry of a dataset to assist applications in caching transformed geometry data. Specifically:A new module
emsarray.operations.cache
with the method:A new method
Convention.hash_geometry(hash)
to hash the dataset geometry:This method would generate a hash of the dataset geometry using whichever properties are relevant for that convention. A default implementation that hashes the name, data, and attributes of all variables in
Convention.get_geometry_variables()
might be appropriate. Specific Convention subclasses can either extend this to hash additional information such as any relevant global attributes, or provide an entirely separate implementation.Applications can use this hash as a key when caching transformed data between different application instances or to reuse transformed data between different partitions of the same dataset:
This feature proposal does not include any caching inside emsarray itself, either in memory or on disk. Future extensions to emsarray may use these functions to cache and reuse the generated geometry for datasets.
This feature proposal does not include any methods that cache arbitrary derived geometry data. Actual cache implementations are left to applications to implement.