Closed CFoye-Creare closed 1 year ago
So far, we have changed caching to only work when the node is in fact a "CachingNode." Because we have a CachingNode
, I want to try and re-do our caching system to rebuild the source
of the node which CachingNode
is caching.
Previously, it wasn't clear in which Coordinate System we were caching. Now, with CachingNode
, we can use the coordinate system of the source
node. So, the idea is to create an on-disk or in-ram Zarr file in the source
's coordinate system. We can have two fields:
The naïve data-fetching algorithm would be:
requested coordinates
requested coordinates
using the "has_data" field to create query_coords
query_coords
requested coordinates
I'm not sure if this takes into account:
I think you guys have discussed and decided this at length, but I'm still hung up on the CachingNode in general, so I wanted to comment to get this out there.
My inclination is to distinguish between things that effect the result and things that effect the execution. Things that effect the result get to be nodes, and things that only effect the execution should not. The same pipeline should be able to be evaluated with caching turned on or off at any particular node(s), multithreaded or single-threaded, local or cloud, etc. A CachingNode is the opposite in that it has no effect on the result, and if I share a pipeline with you but you don't want the same caching, you have to edit the pipeline itself. And now you have two nodes in the pipeline that produce the same (probably intermediate) result, and other nodes that refer to that node as a source need to all make sure that they use the cache node and not the source node.
Hmm, and some data sources can get updated, so the cache should expire. That's easy enough to implement, but if there is a moment when the cache is stale, but if some nodes in the pipeline are pointing to the cached source and some are pointing to the uncached source, it could get confusing. Maybe the solution here is just to disallow CacheNode in the pipeline, and have cache controls specified as attributes in the pipeline that are handled to specially add the CacheNode in (via that .cache()
method)?
@mpu-creare I'd be curious to hear about the direction that the most important users and use-cases that make this tradeoff worthwhile. Are the JSON pipelines and AWS execution not particularly important at this point?
@CFoye-Creare How is the runtime for accessing the has_data flag vs just accessing the data itself? I'm curious if it's better to just pull data from the ZarrCache and get NaN wherever it doesn't have cached data, and then evaluate the node wherever the data is NaN. One way to implement that would be that a CachingNode
is actually just a OrderedCompositor
where the first source node is a ZarrRaw
and the second is the original data source, with the only difference that you write back to the ZarrRaw
at the end. Does that make sense? That seems very easy to implement. Is that similar to what you sent me in teams, except you proposed that the CachingNode
has an OrderedCompositor instead of is an OrderedCompositor?
@jmilloy I do like your inclination. Separating aspects that change the value
versus the evaluation method
makes sense philosophically. Practically, I can't think of an implementation that I like...
The use-case that's driving the current development is the following: Starting from a clean server, add JSON-defined pipelines so that you can re-create our down-scaling algorithms -- that whole system. So, zero Python code.
The most straight-forward way of doing that is including the CacheNode
as part of the pipeline definition -- the execution mode -- the reason for this is that the caching is crucial for the application, in order to get good runtime speeds.
if I share a pipeline with you but you don't want the same caching, you have to edit the pipeline itself. ... Are the JSON pipelines and AWS execution not particularly important at this point?
On the first point, my thinking is that the Pipeline creator knows where to add caching to optimize the performance (for their system, or whatever system they're running on AWS). The original creator probably has the best idea of how that should happen, so sharing that information is helpful! But if the other user has a completely different setup, you are absolutely right. They'd have to edit the pipeline, even to get it to run! My thinking there is: we can add tools to automatically strip out nodes that affect execution, if the need every arises.
if there is a moment when the cache is stale, but if some nodes in the pipeline are pointing to the cached source and some are pointing to the uncached source, it could get confusing
Absolutely agreed. I've run into this type of nightmarish situation while debugging with the old caching system. I'd argue the new system is superior, because at least the cache points are explicit. This problem also motivates the reason why you can globally disable the cache via the podpac.settings
.
This Pull Request introduces a new ZarrCache
Node that serves as a caching layer for a data source Node using Zarr archives.
coordinates
attribute.This Node introduces several important attributes and methods:
source
: The source data Node.zarr_path_data
, zarr_path_bool
: The paths to the Zarr archives for storing data and boolean availability indicators.group_data
, group_bool
: The Zarr groups for storing data and boolean availability indicators.selector
: Selector for selecting coordinates from the source Node._z_node
, _z_bool
: Zarr nodes for data and boolean availability indicators._from_cache
: A flag indicating if the last request for data was served from the cache._validate_request_coords
: Validates that requested coordinates are within the source's coordinate bounds._create_slices
: Creates slices for the requested coordinates.clear_cache
: Clears the Zarr cache. (currently a placeholder for future implementation)get_source_data
: Fetches data from the source Node.fill_zarr
: Fills the Zarr cache with data at specified coordinates.subselect_has
: Determines which coordinates in a request are not present in the cache.eval
: Evaluates the data at requested coordinates, fetching from source and caching if necessary.rem_cache
or clear_cache
, etc. I've thought about adding a method to the Zarr
datasource node that sets all data to default/fill values (This makes a chunk empty in zarr). Let me know what you think about this.HashCache
and ZarrCache
both implement/inherit. This might make it cleaner to implement new Caches, and might make the logic in Node.cache()
cleaner. The problem is I don't know if it is worth it. They only seem to share three things in common: .eval()
, .rem_cache()
and ._from_cache
I'll take another look tomorrow and then merge. I just want to confirm that "property_cache_control" think from earlier.
This PR changes the PODPAC caching system to use the
CachingNode
for all caching. TheCachingNode
is a new node in PODPAC that caches the output of another node. This change should simplify the caching interface.The
CachingNode
can be used by users to cache the output of any PODPAC node. To use it, simply wrap the node you want to cache with aCachingNode
. TheCachingNode
will automatically cache the output of the wrapped node and return the cached output if available.Use of the new
.cache()
method for all Nodes:Or more directly: