Project rationale, scope, and architecture brainstorm

gully commented 3 years ago

The problem to solve

You have just received data from the IGRINS spectrograph and you want to start science. You may get handed plp-reduced data from the observatory facility or from a collaborator, or self-reduced with plp. When you examine the data you may notice some remaining instrumental artifacts: incomplete telluric cancellation artifacts remain. These data artifacts hinder your science progress. Or maybe you want to apply a barycentric correction based on telescope pointing information in the FITS header, but you're not sure which FITS header columns to use when multiple are available. You may want to normalize the spectrum, or populate the spectrum into a pandas dataframe, or estimate a coarse radial velocity based on a template. All of these operations are relatively routine, but their application, order, and algorithm choice may depend on the science case, and therefore they cannot be built into a default pipeline.

Instead, IGRINS users develop bespoke methods for their science application of interest. This code development is relatively costly in time, as each user may "re-invent the wheel" for common operations that fellow IGRINS users have already solved. We wish to avoid the "re-inventing the wheel" phenomenon through encouraging code re-use, and a shared framework for IGRINS data analysis.

The solution and project scope

Here we propose a Python Package for plp-reduced IGRINS data. This package will introduce relevant Python containers that possess common attributes and routine methods. The inspirations for this project are principally lightkurve and specutils. Specifically, this project will encourage an API design, rather than a commandline/script-centric design. The scope of the project is limited to 1D echelle spectra, not the 2D raw frames, which are the domain of the plp. A key idea for the proposed package is that the plp essentially introduced a standard--- all IGRINS data have (roughly) the same attributes and data columns. This package will take advantage of that standardization, and can optionally accommodate data from previous plp version that disobey the current standard. The project will initially begin with local data only, with prospects in the future for querying and populating data from remote databases.

The key priorities of this package are:

Great documentation
Copious tutorials
Easy installation
Robust, uncontroversial, well-tested algorithms for routine operations
Community input and feedback (through GitHub Issues) and contributions (through GitHub Pull Requests)
(eventually) Programmatic interface to querying remote databases such as the Gemini API.

Project Architecture

I propose an architecture closely matching that of lightkurve, which has become an exemplar for ease-of-use within modern astronomical data analysis. The key leap of the imagination is to change from time-series analysis to spectral analysis. To first approximation that's merely by replacing time with wavelength. In practice, several echelle spectroscopy-specific concepts may make the mental mapping a bit subtle to architect. We also want to encourage the reuse of code from specutils.

Some architecture decisions to consider:

What are the routine operations users currently perform
What structure to adopt from lightkurve
What structure to adopt from specutils
What are the tradeoffs in choice of fundamental building blocks? (e.g. Are they based on dataframes, numpy arrays, astropy Tables, Spec1D instances? Something else?)
How granular/flexible to make the notions of objects---MultiOrderSpectrum? SingleOrderSpectrum? FlattenedSpectrum? TelluricCorrectedSpectrum? Or just a single "Spectrum" type object. Tradeoff in API complexity versus expressiveness?
How/where to limit the scope: what is a genuine research problem (out-of-scope), and what is well-agreed-upon operation (in-scope).

gully commented 3 years ago

Ok, I've made a minimal working IGRINS container based off of astropy's specutils, and added a trivial custom method .normalize() to show what I have in mind. Here is a simple example:

from muler.igrins import IGRINSSpectrum

# Read in the spectrum
file = 'data/SDCH_20201225_1234.spec_a0v.fits'
spec = IGRINSSpectrum(file=file, order=10)

# Normalize method as a demo
spec = spec.normalize()

We can add lots of methods to abstract-away the complexity of routine operations. To fuel the imagination, consider the prospect of vaporware methods like:

.mask_telluric_regions( )
.refine_wavelength_calibration(method='telluric')
.stitch_echelle_orders( )
.remove_outliers( )

The IGRINSSpectrum object is an instance of specutils' Spectrum1D, and the flux and wavelength attributes are astropy quantity objects, which are subclasses of familiar numpy arrays:

assert isinstance(spec, Spectrum1D)  # True
assert isinstance(spec.flux, np.ndarray)  # True

In the guts of IGRINSSpectrum, it manually deals with fits files for simplicity, but if specutils were chosen as the right building block, one would take the time to write a custom data loader. An open question is whether Spectrum1D is the strategic building block, instead of say pandas dataframes. For example, lightkurve originally just defined the time and flux numpy arrays themselves, but has since moved to build off of astropy TimeSeries, with a significant cost to the migration process. So we'd like to make the right choice from the beginning.

My hunch is that Spectrum1D objects are the right building block, since they're specifically designed for astronomical applications. I mostly see upsides to specutils, but it's worth mentioning the conceivable demerits:

Units-- Sometimes astropy units can get in the way, and weird gotchas pop-up when one is used to working with bare numpy arrays and dealing with the units themselves. I like units, but they can occasionally cause more harm than good.
Bugs and warnings-- So far specutils has been really verbose about warnings and various deprecations. Lots of warnings like this detract from the simplicity we're going for. Performance/robustness tradeoff-- When I divided flux by a scalar, it somehow involved scipy interpolate. There's no need to bring interpolation into the equation when broadcasting a numpy scalar to a vector. It was hard to trace down where this interpolation warning is coming from because there are multiple inheritance mixins here and there. I think the aim of this module is to make it easy to access data. If performance is needed we can make individual methods more performant, or offer keyword argument toggles, like method='precise'.

kfkaplan commented 3 years ago

It's good to see some work being done on code for IGRINS analysis. If any of it is useful, feel free to grab bits and pieces of my own python code for IGRINS that I've written over the years.

This is a small program to try to correct for differences in flexure and airmass between science targets (must be stars with some continuum) and A0V standard stars in the files the pipeline outputs: https://github.com/kfkaplan/IGRINS_A0V_corrector This might be useful to integrate into muler.

Probably less useful but much larger is my old "plotspec" code for processing 2D IGRINS data: https://github.com/kfkaplan/plotspec

gully commented 3 years ago

Thanks @kfkaplan, these community-based contributions are exactly what we're looking for, with enhancements to documentation and testing. I am going to try to get the documentation working today and for the short term, maybe it'd be useful to link to publicly available analysis packages in a section titled something like "community software".

gully commented 3 years ago

Woohoo, we have a working documentation page to illustrate the auto-documenting design of docstrings and methods.

https://muler.readthedocs.io/

gully commented 3 years ago

I am delighted to report that the early vision and architecture brainstorm laid out in this first Issue have now been realized! We now have a working, useful, and semi-stable package! :). It's still in development, but the brainstorming phase is largely complete. 🥳

OttoStruve / muler