csiro-coasts / emsarray

xarray extension that supports EMS model formats
BSD 3-Clause "New" or "Revised" License
13 stars 2 forks source link

Add method that generates a hash of geometry variables for caching #153

Open mx-moth opened 3 weeks ago

mx-moth commented 3 weeks ago

A proposed feature from a discussion with @frizwi. Some applications perform transformations on the geometry generated by emsarray such as triangulating the polygons, or otherwise derive application-specific data from the dataset geometry. These transformations can be computationally expensive. It would be beneficial if these derived data could be cached between different runs of the application, and if these data could be shared between different instances of the same dataset geometry for datasets partitioned in multiple files across time, for example. This proposal is for a new module emsarray.operations.cache which can generate a hash of the geometry of a dataset to assist applications in caching transformed geometry data. Specifically:

A new module emsarray.operations.cache with the method:

def make_cache_key(dataset: xarray.Dataset, hash_type: type[hashlib._Hash] = hashlib.sha1) -> str:
    """
    Generate a key suitable for caching data derived from the geometry of a dataset.

    Parameters
    ----------
    dataset : xarray.Dataset
        The dataset to generate a cache key from
    hash : hash class
        The kind of hash to use.
        Defaults to `hashlib.sha1`, which is secure enough and fast enough for most purposes.
        The hash algorithm does not need to be cryptographically secure,
        so faster algorithms such as `xxhash` can be swapped in if desired.

    Returns
    -------
    cache_key : str
        A string suitable for use as a cache key.
        The string will be safe for use as part filename if data are to be cached to disk.

    Notes
    -----
    The cache key will depend on the Convention class,
    the emsarray version, and a hash of the geometry of the dataset.
    The specific structure of the cache key may change between emsarray versions
    and should not be relied upon.
    """

A new method Convention.hash_geometry(hash) to hash the dataset geometry:

class Convention:
    def hash_geometry(self, hash: hashlib._Hash) -> None:
        """
        Update the provided hash with all of the relevant geometry data for this dataset.
        This method must be deterministic based only on the dataset,
        resulting in the same hash is called multiple times,
        even across different instances of the Python interpreter.
        Ideally the hash should be independent of any variable data
        such that datasets partitioned across time in to multiple files on disk
        will result in identical hashes.

        Parameters
        ----------
        hash : hashlib-style hash instance
            The hash instance to update with geometry data.
            This must follow the interface defined in :mod:`hashlib`.
        """

This method would generate a hash of the dataset geometry using whichever properties are relevant for that convention. A default implementation that hashes the name, data, and attributes of all variables in Convention.get_geometry_variables() might be appropriate. Specific Convention subclasses can either extend this to hash additional information such as any relevant global attributes, or provide an entirely separate implementation.

Applications can use this hash as a key when caching transformed data between different application instances or to reuse transformed data between different partitions of the same dataset:

class DatasetTriangulation:
    def __init__(self):
        self.triangulations = {}

    def triangulate(self, dataset):
        cache_key = cache.make_cache_key(dataset)
        if cache_key in self.triangulations:
            return self.triangulations[cache_key]
        triangulation = triangulate.triangulate_dataset(dataset)
        self.triangulations[cache_key] = triangulation
        return triangulation

This feature proposal does not include any caching inside emsarray itself, either in memory or on disk. Future extensions to emsarray may use these functions to cache and reuse the generated geometry for datasets.

This feature proposal does not include any methods that cache arbitrary derived geometry data. Actual cache implementations are left to applications to implement.