Spectra storage format specification

iancze commented 8 years ago

I would like to use this as a discussion point for fleshing out a data format spec for use in Starfish. This also pertains to #21. Spurred on by a recent post heard via @gully, I'd like to devote some time to ironing out this spec sooner rather than later. First, the possible choices (feel free to suggest more):

FITS: While astropy is already a dependency of Starfish, I would like to avoid using the FITS format to store spectra for reading into the code. This is primarily because it seems as though each observatory has a slightly different standard by which they store their data. For many of my spectra, I have been unable to use astropy to read the data reliably and in the end resorted to using Pyraf to convert the FITS file to text.

ASCSII: ASCII text files might be a good, simple solution that is easily understood by all and provides a low barrier for users ingesting spectra from new instruments. However, I do worry if future larger datasets may make this format unweildy. In particular, it isn't immediately clear to me how we should store echelle data in this format without requiring a separate file for each order.

NPY: Numpy save files. While this certainly would be easy from a Python standpoint, I would like to preserve some degree of cross-language compatibility in the data format, since there are many other routines in other languages that interact with spectra and it would be nice to allow them access. Also, preserving any metadata associated with the spectrum (e.g., barycentric correction) may be difficult.

HDF5: This format is my current preferred format. For the TRES spectra that I have been primarily dealing with, I have the good fortune that all orders contain the same number of pixels and so I have been storing spectra as 2D arrays with shape (norders, npix). As @gully mentioned in #21, this is unlikely to be the case for all spectrographs. I think I would prefer to move to a ragged array format, whereby there is a separate /00, /01, /02, etc folder for each order, within which is contained 1D arrays of wl, fl, and sigma. The order names need not necessarily be 0, 1, 2, if there are only a few orders it could also be something more descriptive like blue and red. HDF5 has the added benefit that we can continue to add in metadata as we see fit.

Masking: on a related note, when should we apply masks to data? If a raw spectrum from the telescope uses NaN or 1e-99 or other numbers to denote bad pixels, should we be propagating these values all the way through the fitting routines? Is it possible to clean these values from the spectrum as we are putting the raw spectrum into one of the above mentioned formats? My hope is that the user could clean these values in regions where it may be obvious (say 200 NaNs in a row at the edge of an order), but also allow a (possibly separate) mechanism by which the user can mask out real regions of the spectrum that cannot be reproduced by the model (e.g., telluric lines, cores of deep absorption lines). This second masking mechanism would likely be iterative, since this is more of a fitting task than a data task.

My current thinking is to continue using HDF5 in the hope that we can make the file format straightforward and easy to maintain, but I am interested in hearing what all of the contributors think!

gully commented 8 years ago

In short, I think HDF5 is fine, and if we wanted to switch I would be inclined to go with some form of ASCII, either csv or JSON. The new ASDF format could be worth looking into.

Longer version:

So despite the critical post about HDF5, I have been satisfied with using HDF5 files in this project. The main friction for me was the assumption of equal length spectral orders (Issue #21). The practical solution to the unequal spectral orders would be to move to the "ragged array format" you describe above. The other concern about HDF5 is something that @kgullikson88 first drew to my attention, and was reiterated in the recent post above-- HDF5 files can and do catastrophically fail. Kevin's workaround is to save incremental backups of all of his HDF5 files. A slight pain.

There's also the new ASDF file format, described here, but I am not sure how mature this format is, nor what the long-term adoption will be.

Despite its inefficiency, I'm often a fan of ASCII. There are so many good tools, like pandas and its best in class read_csv(). But you're right about multiple orders getting bogged down by unique files. A very long and thin (5 x N_pix x N_ord) csv file (wavelength, flux, error, mask, order) could be a manageable, albeit inefficient, compromise. But computer time is getting cheaper so performance hits are not a big deal, and the power users will probably rewrite the code to do what they want anyways.

Another solution would be JSON. Starfish already uses this specification for the output of the star.py --generate spectrum. It could house many orders of unequal length. It's also a web standard, so we know it will be around for a long time. It can store metadata easily.

iancze commented 8 years ago

Thanks for the thoughtful feedback. You bring up a good point about already using JSON to store the sub-parameters for each order. One thing that I would love to address in future development is streamlining the package and tidying up dead-ends and quirks. A major component of this is a consistent and accessible storage choice so that users are not bogged down with a new format each time they explore a different corner of the codebase.

I chose JSON to store the order parameter values because of the features you mention. My single (but major!) pet-peeve was that the default python3 JSON serializer (silently?) failed on numpy.int64 and numpy.float64 values (single ints and floats not in arrays), creating a confusing pitfall for the user. I recently discovered that astropy has a nice encoder class that appears to solve all of these problems. Since astropy is already a dependency of Starfish, I am much more eager to use JSON as a standard file format for spectra, sub-parameters, and whatever else may arise. With the exception of storing a large number of MCMC samples (probably should stick with NPY or HDF5 here), I do not see any application for which JSON would be insufficient (though please alert me if you think of one!). I think I will start a section in the docs about file formats in the hope of consolidating some of these thoughts.

iancze commented 8 years ago

I drafted a format that uses JSON over at https://github.com/iancze/EchelleJSON

For a use case, check out test.py.

Starfish-develop / Starfish

Spectra storage format specification #24