Explore S3 caching options

We've talked about a use case where archives decide to keep datasets internally, but put up S3 API facade for remote access with AXS. E.g., imagine the data is physically in IPAC and MAST, but being analyzed at TACC. The question then is whether accesses to the datasets can transparently be cached where AXS is running, for faster repeated access.

Option 1: Spark seems to have recently added support for caching of remote datasets through Delta cache. It's not clear to me whether this is broadly available, or a Databricks-only thing? This should be the thing to investigate first.

Option 2: Another way to do this may be to have AXS access the files through a caching layer. I looked at S3 caching options, and found there are many. Example:

https://github.com/gaul/s3proxy
https://github.com/s3fs-fuse/s3fs-fuse

(and see the list of more projects at the bottom of s3fs-fuse README).

Opening this issue so we don't forget about this use case.

astronomy-commons / axs

Explore S3 caching options #15