CFoye-Creare commented 1 year ago

This PR changes the PODPAC caching system to use the CachingNode for all caching. The CachingNode is a new node in PODPAC that caches the output of another node. This change should simplify the caching interface.

The CachingNode can be used by users to cache the output of any PODPAC node. To use it, simply wrap the node you want to cache with a CachingNode. The CachingNode will automatically cache the output of the wrapped node and return the cached output if available.

Use of the new .cache() method for all Nodes:

Or more directly:

# Create an instance of the data source behind a CachingNode
node = MyDataSource(lat=0, lon=0).cache()

[.]: node
[.]: <MyDataSource()>

[.]: node.cache()
[.]: <CachingNode()>

CFoye-Creare commented 1 year ago

General Idea

So far, we have changed caching to only work when the node is in fact a "CachingNode." Because we have a CachingNode, I want to try and re-do our caching system to rebuild the source of the node which CachingNode is caching. Previously, it wasn't clear in which Coordinate System we were caching. Now, with CachingNode, we can use the coordinate system of the source node. So, the idea is to create an on-disk or in-ram Zarr file in the source's coordinate system. We can have two fields:

"Has Data": a boolean field which keeps track of which data has been cached
"Data": the actual cached data

The naïve data-fetching algorithm would be:

Check cached node using the requested coordinates
Sub-select from requested coordinates using the "has_data" field to create query_coords
Get data from server using query_coords
Fill in data into cached data
Set "has_data" field
Query cached data using requested coordinates
Return data

I'm not sure if this takes into account:

Non-temporal data?
How much data are we storing? The entire globe for every Zarr File?
Doesn't take into account expiration

jmilloy commented 1 year ago

I think you guys have discussed and decided this at length, but I'm still hung up on the CachingNode in general, so I wanted to comment to get this out there.

My inclination is to distinguish between things that effect the result and things that effect the execution. Things that effect the result get to be nodes, and things that only effect the execution should not. The same pipeline should be able to be evaluated with caching turned on or off at any particular node(s), multithreaded or single-threaded, local or cloud, etc. A CachingNode is the opposite in that it has no effect on the result, and if I share a pipeline with you but you don't want the same caching, you have to edit the pipeline itself. And now you have two nodes in the pipeline that produce the same (probably intermediate) result, and other nodes that refer to that node as a source need to all make sure that they use the cache node and not the source node.

Hmm, and some data sources can get updated, so the cache should expire. That's easy enough to implement, but if there is a moment when the cache is stale, but if some nodes in the pipeline are pointing to the cached source and some are pointing to the uncached source, it could get confusing. Maybe the solution here is just to disallow CacheNode in the pipeline, and have cache controls specified as attributes in the pipeline that are handled to specially add the CacheNode in (via that .cache() method)?

@mpu-creare I'd be curious to hear about the direction that the most important users and use-cases that make this tradeoff worthwhile. Are the JSON pipelines and AWS execution not particularly important at this point?

jmilloy commented 1 year ago

@CFoye-Creare How is the runtime for accessing the has_data flag vs just accessing the data itself? I'm curious if it's better to just pull data from the ZarrCache and get NaN wherever it doesn't have cached data, and then evaluate the node wherever the data is NaN. One way to implement that would be that a CachingNode is actually just a OrderedCompositor where the first source node is a ZarrRaw and the second is the original data source, with the only difference that you write back to the ZarrRaw at the end. Does that make sense? That seems very easy to implement. Is that similar to what you sent me in teams, except you proposed that the CachingNode has an OrderedCompositor instead of is an OrderedCompositor?

mpu-creare commented 1 year ago

@jmilloy I do like your inclination. Separating aspects that change the value versus the evaluation method makes sense philosophically. Practically, I can't think of an implementation that I like...

The use-case that's driving the current development is the following: Starting from a clean server, add JSON-defined pipelines so that you can re-create our down-scaling algorithms -- that whole system. So, zero Python code.

The most straight-forward way of doing that is including the CacheNode as part of the pipeline definition -- the execution mode -- the reason for this is that the caching is crucial for the application, in order to get good runtime speeds.

if I share a pipeline with you but you don't want the same caching, you have to edit the pipeline itself. ... Are the JSON pipelines and AWS execution not particularly important at this point?

On the first point, my thinking is that the Pipeline creator knows where to add caching to optimize the performance (for their system, or whatever system they're running on AWS). The original creator probably has the best idea of how that should happen, so sharing that information is helpful! But if the other user has a completely different setup, you are absolutely right. They'd have to edit the pipeline, even to get it to run! My thinking there is: we can add tools to automatically strip out nodes that affect execution, if the need every arises.

if there is a moment when the cache is stale, but if some nodes in the pipeline are pointing to the cached source and some are pointing to the uncached source, it could get confusing

Absolutely agreed. I've run into this type of nightmarish situation while debugging with the old caching system. I'd argue the new system is superior, because at least the cache points are explicit. This problem also motivates the reason why you can globally disable the cache via the podpac.settings.

CFoye-Creare commented 1 year ago

ZarrCache Node for PODPAC

This Pull Request introduces a new ZarrCache Node that serves as a caching layer for a data source Node using Zarr archives.

Key Features

Caching with Zarr: This Node allows users to cache their data in a Zarr archive, a high-performance data storage format. It maintains two separate Zarr archives: one for the actual data and one for boolean indicators specifying if the data at a certain coordinate is available in the cache.
Selective Caching: This Node only requests data from the source Node if the data at those coordinates is not already present in the Zarr cache, reducing unnecessary data requests and saving computational resources.
Automatic Zarr Group Management: The Node creates and manages Zarr groups corresponding to the data and the boolean availability indicator. It ensures these groups exist before creating Zarr nodes for data handling.
Flexible Source Node Support: This Node can be used with any PODPAC Node that serves as a data source and has a coordinates attribute.

Detailed Explanation

This Node introduces several important attributes and methods:

Attributes

source: The source data Node.
zarr_path_data, zarr_path_bool: The paths to the Zarr archives for storing data and boolean availability indicators.
group_data, group_bool: The Zarr groups for storing data and boolean availability indicators.
selector: Selector for selecting coordinates from the source Node.
_z_node, _z_bool: Zarr nodes for data and boolean availability indicators.
_from_cache: A flag indicating if the last request for data was served from the cache.

Methods

_validate_request_coords: Validates that requested coordinates are within the source's coordinate bounds.
_create_slices: Creates slices for the requested coordinates.
clear_cache: Clears the Zarr cache. (currently a placeholder for future implementation)
get_source_data: Fetches data from the source Node.
fill_zarr: Fills the Zarr cache with data at specified coordinates.
subselect_has: Determines which coordinates in a request are not present in the cache.
eval: Evaluates the data at requested coordinates, fetching from source and caching if necessary.

What needs review

We need to come up with a good implementation of rem_cache or clear_cache, etc. I've thought about adding a method to the Zarr datasource node that sets all data to default/fill values (This makes a chunk empty in zarr). Let me know what you think about this.
I'm wondering if we shouldn't define a new interface/CacheNode which the HashCache and ZarrCache both implement/inherit. This might make it cleaner to implement new Caches, and might make the logic in Node.cache() cleaner. The problem is I don't know if it is worth it. They only seem to share three things in common: .eval(), .rem_cache() and ._from_cache
I'm not sure if the unit tests are robust enough.
I'm not sure if the node takes advantage of enough the features of zarr.
I don't know if I need two groups, I probably just need 2 arrays. But I remember this being easier due to the setup of the Zarr datsource node? I made that decision a couple weeks ago and didn't document why. But I remember 1-group-2-arrays solution not being as clean/working as well.

mpu-creare commented 1 year ago

I'll take another look tomorrow and then merge. I just want to confirm that "property_cache_control" think from earlier.

creare-com / podpac

Caching using a Node #507

General Idea

ZarrCache Node for PODPAC

Key Features

Detailed Explanation

Attributes

Methods

What needs review