BrownDwarf / gollum

A microservice for programmatic access to precomputed synthetic spectral model grids in astronomy
https://gollum-astro.readthedocs.io/
MIT License
21 stars 5 forks source link

add an option to save and load the processed grid #12

Open zjzhang42 opened 3 years ago

zjzhang42 commented 3 years ago

should add a save button to store, and a load button to load grids without repeatedly processing them.

gully commented 2 years ago

@SujayShankarUT and I just looked into this prospect for faster loading. Here are some quick numbers to help guage feasibility:

For 854 PHOENIX grid points: Reading the grid into PHOENIXGrid took 55.6 seconds. (Creating and storing the 2D Flux array to a .npy binary file took 7 seconds.) Re-reading that "cached" 2D Flux array took a mere 1 second.
The resulting binary file was about 2.6 GB.

So caching the raw flux values in this way could hypothetically achieve a $50\times$ speedup. However, some of that 50 seconds is associated with unavoidable overhead of creating the Python class. So it's not yet clear to what extent this processes is bottlenecked by I/O or the inefficiencies in specutils.

Sujay and I were working on quantifying the overhead of a "Passthrough" to SpectrumCollection. We got foiled by some unexpected specutils behavior that we are now investigating. To be continued...

gully commented 2 years ago

Update: We successfully got the passthrough approach working. The "round-tripping" from the 2D numpy flux array, back into a PHOENIXGrid object took 7 seconds. The implementation is incomplete (lacking metadata still), but assuming that small change takes over order 0 seconds, then the hypothetical best speedup we could achieve is:

1 second + 7 second = 8 seconds.

55/8 = $7\times$

So that's a significant speedup. However, it now causes a bloating of the disk: you are storing both the original files and a possibly ~5 GB file depending on your grid volume and wavelength range of interest. So there's a tradeoff in time spent reading in, and disk space, and the mental overhead of remembering which "mini grids" or "grid sub-sections" you have saved and where. Some of that mental overhead can be alleviated with clever caching (e.g. memoization) on the gollum-side, but there are still limitations and opportunities for that code to break... More to be considered

gully commented 2 years ago

The collection of FITS files for zero metallicity is 5.29 GB The numpy binary of just the 2D flux array stores as 1.78 GB (but is missing some minor metadata...)

So we could hypothetically compressed the FITS files by something like $3\times$. So 120 GB of PHOENIX would "only" be about 40 GB of a numpy save file.

gully commented 2 years ago

@SujayShankarUT I think the speedup and data compression gains will be much greater for the Sonora Grid. Currently the Sonora grid is stored as text files (not fits), and each file also stores the same wavelength coordinate. The whole metallicity grid (excluding C/O ratio) is only 11 GB of text files. So I think we could get it down to a single few GB binary.