In issue #12 we discuss prospects of caching grids in an efficient binary format such as .npy files. However, that issue implied that the user would be doing the caching on a per-subset basis. Here, instead we propose the one-time storage in an efficient format of an entire grid.
Open Questions: What storage format do we adopt? .npy, .h5, .parquet, .feather, etc. Where to store? Strong preference for zenodo.
Requirements:
Whatever format we have must have some form of storing metadata (teff, logg, Z, fsed, etc)
Should be binary-based (small storage and fast access), should support efficient columnar data access
Format should be stable and have longevity
Pros:
Users will only have to download once
Significantly faster IO
Storage impact is significantly lower
Easier for gollum developers to support cross-platform data downloads
Cons:
Must be managed by gollum maintainers instead of by grid creators/users
Prospect for divergence between native primary source documents and our secondary source documents, requires QA
Use case workflow:
User installs gollum, says "I want to work with Sonora Diamondback"
User runs SonoraDiamondbackSpectrum.download_grid(location=)
Ultra zipped compressed archive is stored somewhere online
Function grabs the archive and downloads it into a directory specified by the location they provided
Now when user asks for a sonora spectrum, all we do is pick out the columns(grid points) of our dataframe that we want and load them in (very fast because the data has been specifically optimized for this operation)
In issue #12 we discuss prospects of caching grids in an efficient binary format such as .npy files. However, that issue implied that the user would be doing the caching on a per-subset basis. Here, instead we propose the one-time storage in an efficient format of an entire grid.
Open Questions: What storage format do we adopt? .npy, .h5, .parquet, .feather, etc. Where to store? Strong preference for zenodo.
Requirements:
Pros:
Cons:
Use case workflow: