astronomy-commons / axs

Astronomy eXtensions for Spark: Fast, Scalable, Analytics of Billion+ row catalogs
https://axs.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
23 stars 12 forks source link

Explore S3 caching options #15

Open mjuric opened 4 years ago

mjuric commented 4 years ago

We've talked about a use case where archives decide to keep datasets internally, but put up S3 API facade for remote access with AXS. E.g., imagine the data is physically in IPAC and MAST, but being analyzed at TACC. The question then is whether accesses to the datasets can transparently be cached where AXS is running, for faster repeated access.

Option 1: Spark seems to have recently added support for caching of remote datasets through Delta cache. It's not clear to me whether this is broadly available, or a Databricks-only thing? This should be the thing to investigate first.

Option 2: Another way to do this may be to have AXS access the files through a caching layer. I looked at S3 caching options, and found there are many. Example:

(and see the list of more projects at the bottom of s3fs-fuse README).

Opening this issue so we don't forget about this use case.

(@dennyglee, @zecevicp, any thoughts/ideas/comments?)