Revamp Spectrum1D tabular-fits writer

astropy / specutils

An Astropy coordinated package for astronomical spectroscopy. Maintainers: @rosteen @keflavich @eteq

http://specutils.readthedocs.io/en/latest/

169 stars 127 forks source link

Revamp Spectrum1D tabular-fits writer #905

Open kelle opened 2 years ago

kelle commented 2 years ago

I'm having a lot of trouble figuring out how to write a Spectrum1D object to a file. A few of my questions:

What file types are supported? FITS, ASDF?
How do I also make sure the header data is also saved?

I'm happy to work on this once I understand it. :)

kelle commented 1 year ago

As a result of digging into this, I discovered that header data does not currently roundtrip using Spectrum1D.write. I propose that this be implemented using something like this:

# Make a dedicated HDU for the header and spectrum
hdu0 = fits.PrimaryHDU(header=header)
hdu1 = fits.BinTableHDU(data=NIR_table)
hdu1.header['EXTNAME'] = 'SPECTRUM'
hdu1.header.set('OBJECT', NIRSpec_dict['object_name'], 'Object Name')

# Combine the two HDUs into a multi-extension FITS (MEF)
spectrum_mef = fits.HDUList([hdu0, hdu1])  # hdu0 is header and hdu1 is data

# Write out the MEF to a file
spectrum_mef.writeto(fits_filename, output_verify="exception")

# Demonstrate how to read the file back in using specutils
spec1d_NIR = Spectrum1D.read(fits_filename, format='tabular-fits')
header_NIR = fits.getheader(fits_filename)

If folks agree, I would be happy to implement this given some guidance.

pllim commented 1 year ago

@dhomeier might also be interested in this issue since he is working on:

1009

kelle commented 1 year ago

Summarizing relevant discussions from Slack:

If someone modifies the spectrum, the keywords might become inaccurate. My proposal is to add a warning that says something like, "Header keywords are the responsibility of the user and are not guaranteed to be accurate." We can open issues and what not on specutils routines which modify header keywords and gradually make this less true. But I think that header round tripping is important enough to tolerate this shortcoming.
I am not sure how to implement this for Spectrum2D, 3D or SpectrumList. I am happy to just do it for Spectrum1D -- this would be immensely useful -- OR to figure out how to make it extendable to these other classes.

dhomeier commented 1 year ago

First I am not sure this issue really is a duplicate or, or completes #908 – the latter is primarily on better documentation of the existing options for writing spectra, while this one at least with the recent discussion is heading more into proposing new functionality.

I think the current "generic" read/write loaders (in fact the only ones implementing write functionality) have already been constructed with some round-tripping support in mind, though incomplete yet. E.g. #1009 tries to improve write and read-back of the mask and uncertainty attributes.

One important point here is that Spectrum1D has no header property as such; that is rather tied to the specific FITS file layout. The location to hold general information of that type is meta, which could be constructed directly from header information of a FITS HDU, or be a superset of the same, or could also contain objects that are very hard or impossible to serialise to a FITS header. But round-trip of all information that can be represented as a FITS header would certainly be a reasonable target. I think the main problems here are

Existing FITS loaders, if they save header information, variously do so by extracting the entire header directly into spectrum.meta or creating a spectrum.meta['header'] dictionary that contains only the FITS header information. The latter is more robust as other content could be added to meta later, but within that scheme the FITS writers would probably be restricted to writing only meta['header'] and no other metadata information (otherwise I foresee running into an infinite recursion...).
With the above convention there is still the question whether to load the header data from the HDU containing the spectrum, the primary HDU (which at least for tabular-fits will be different, but still may contain relevant information), or all of them; and then where to write them back.
Settling onto one of those schemes has already been discussed at various time, but I think with no clear conclusion yet. Also there has been some concern about always saving all header information, potentially creating inflated meta attributes with content most people might never look at.
As noted from the Slack discussions, many FITS keywords really are connected directly to the data format and representation and thus are already present in Spectrum1D e.g. as flux.dtype, flux.unit. Still storing them in meta in a way is redundant and may get out of sync if the data parts are modified. I think to some extent the header parsers already try to take care of that and are stripping cards like BUNIT (which would be overwritten when writing back to a HDU anyway).

A good starting point could perhaps be to review what info is already preserved in round-trip tests like https://github.com/astropy/specutils/blob/ca3153f05b8d3613bba0b865e273187e8ff176c4/specutils/tests/test_loaders.py#L608-L611

and what is missing that you would like to be saved. As sketched in your proposed implementation, getting header data back from tabular-fits would require bypassing the Table.read interface in the loader, as it provides to my knowledge no option for saving header data. wcs1d-fits would already do this, but then again is inconsistent with even its own writer, since it is using the meta['header'] scheme, while both writers are trying to write back all of meta with update_header=True set. Again I think there has been discussion why that would be problematic to change, and break existing behaviour.

kelle commented 1 year ago

We've tested the spectrum.meta.update method and it's pretty close! Problems include

extra END keywords. This makes many header keywords inaccessible by other tools. For example,
```
header=fits.getheader(tmpfile)
```
results in a header which just has the minimal keywords and not any of the things we added. (I assume this is because of the no-loneger-necessary END keyword)
COMMENT and HISTORY are stripped. I assume this is on purpose. Is there a way to keep them?

dhomeier commented 1 year ago

That's what you get with fits.getheader()? That would be an io.fits issue, but I cannot reproduce any stripping of COMMENT cards. Looking back at https://github.com/astropy/specutils/blob/ca3153f05b8d3613bba0b865e273187e8ff176c4/specutils/io/default_loaders/tabular_fits.py#L117-L127

for anything you put into a meta['header'] directory you should actually not even need the update_header option; but from either source it should keep everything that has types it can handle, except for NAXIS*. That's also something perhaps to be matched in wcs1d-fits, which by default only initialises from the WCS header.

kelle commented 1 year ago

In my case, the COMMENT and HISTORY cards are stripped because they are <class 'astropy.io.fits.header._HeaderCommentaryCards'>
Even though I use spec.meta.update(header), the meta doesn't roundtrip unless I use update_header = True. Not sure why. (This is not a big deal to me at the moment.)
The meta dictionary doesn't keep the comment that goes along with each keyword/value pair. A FITS header a list of three elements, not just two. The keyword, the value, AND the comment. So as soon as I do spec1d.meta.update(header), I've lost information.
The extra ENDs must come from io.fits or from Table. I've posted on Slack to track this one down.

I think the bottom line is that meta doesn't maintain the functionality that one expects of FITS headers.

We could continue to go down this road or maybe I should just write a custom reader/writer that is based on tabular-fits but expects HDU0 to have the header data.

kelle commented 1 year ago

Just to show y'all what I'm talking about, here's the degradation of the header information using the current tooling.

The header data I have compiled and added using spec1d.meta.update(header):

OBJECT  = 'VHS1256b'           / Name of observed object                        
RA      = '194.007637'         / [deg] Pointing position                        
DEC     = '-12.957692'         / [deg] Pointing position                        
TELESCOP= 'JWST    '           / name of telescope                              
INSTRUME= 'NIRSpec '           / name of instrument                             
SPECBAND= 'NIR     '           / SED.bandpass                                   
SPEC_VAL=                 2.25 / [um] Characteristic spec coord                 
SPEC_BW =                  4.5 / [um] Width of spectrum                         
TDMIN1  =                  0.0 / [um] Starting wavelength                       
TDMAX1  =                  4.5 / [um] Ending wavelength                         
DATE    = '2023-02-24'         / Date of file creation                          
COMMENT To read this file in with specutils use Spectrum1D.read() with format = 
COMMENT 'tabular-fits'                                                          
HISTORY Data Reduced by JWST pipeline version 1.7.2, CRDS version 11.16.12, CRDS
HISTORY  context jwst_0977.pmap

The header obtained via meta in the read back in file:

OrderedDict([('OBJECT', 'VHS1256b'),
             ('RA', '194.007637'),
             ('DEC', '-12.957692'),
             ('TELESCOP', 'JWST'),
             ('INSTRUME', 'NIRSpec'),
             ('SPECBAND', 'NIR'),
             ('SPEC_VAL', 2.25),
             ('SPEC_BW', 4.5),
             ('TDMIN1', 0.0),
             ('TDMAX1', 4.5),
             ('DATE', '2023-02-24')])

The header obtained using fits.getheader:

SIMPLE  =                    T / conforms to FITS standard                      
BITPIX  =                    8 / array data type                                
NAXIS   =                    0 / number of array dimensions                     
EXTEND  =                    T

And the header directly from the FITS file

SIMPLE  =                    T / conforms to FITS standard                      
BITPIX  =                    8 / array data type                                
NAXIS   =                    0 / number of array dimensions                     
EXTEND  =                    T                                                  
END                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             XTENSION= 'BINTABLE'           / binary table extension                         
BITPIX  =                    8 / array data type                                
NAXIS   =                    2 / number of array dimensions                     
NAXIS1  =                   24 / length of dimension 1                          
NAXIS2  =                   10 / length of dimension 2                          
PCOUNT  =                    0 / number of group parameters                     
GCOUNT  =                    1 / number of groups                               
TFIELDS =                    3 / number of table fields                         
TTYPE1  = 'wavelength'                                                          
TFORM1  = 'D       '                                                            
TUNIT1  = 'um      '                                                            
TTYPE2  = 'flux    '                                                            
TFORM2  = 'D       '                                                            
TUNIT2  = 'Jy      '                                                            
TTYPE3  = 'uncertainty'                                                         
TFORM3  = 'D       '                                                            
TUNIT3  = 'Jy      '                                                            
OBJECT  = 'VHS1256b'                                                            
RA      = '194.007637'                                                          
DEC     = '-12.957692'                                                          
TELESCOP= 'JWST    '                                                            
INSTRUME= 'NIRSpec '                                                            
SPECBAND= 'NIR     '                                                            
SPEC_VAL=                 2.25                                                  
SPEC_BW =                  4.5                                                  
TDMIN1  =                  0.0                                                  
TDMAX1  =                  4.5                                                  
DATE    = '2023-02-24'                                                          
END

dhomeier commented 1 year ago

In my case, the COMMENT and HISTORY cards are stripped because they are <class 'astropy.io.fits.header._HeaderCommentaryCards'>

Ah, I did not expect that would not qualify as subtype of str. But that's easily enough added to the filter.

Even though I use spec.meta.update(header), the meta doesn't roundtrip unless I use update_header = True. Not sure why. (This is not a big deal to me at the moment.)

That's exactly as expected, as meta.update(header) will just pull the cards directly into meta. To have it automatically saved, you need to assign spec.meta['header'] = header instead. This will then save only header of course. Then instead of having meta['OBJECT'] = 'VHS1256b' you'd have meta['header']['OBJECT'] = 'VHS1256b' and so on.

The meta dictionary doesn't keep the comment that goes along with each keyword/value pair. A FITS header a list of three elements, not just two. The keyword, the value, AND the comment. So as soon as I do spec1d.meta.update(header), I've lost information.

OK that's a more serious problem; I think proper treatment of comments has also been discussed a couple of times, but not sure if a good solution has come out of it.

I think the bottom line is that meta doesn't maintain the functionality that one expects of FITS headers.

Or, put it the other way, it has much wider functionality, so cannot be expected to conform to all FITS header conventions.

Thinking that over again, I am more and more leaning towards sticking to meta['header'] for a round-trip solution – i.e. use the default (and document it appropriately), that only that part of the meta information will be preserved through read-write operations, while other content should be treated as available for runtime only. That would leave it to the user to ensure only FITS header conforming cards go in there, although one might think of defining an actual meta.header or Spectrum1D.header attribute that does check this. A bit more work though...

The header obtained using fits.getheader:

That is just returning the PrimaryHDU header by default, so fits.getheader(1) or fits.getheader('SPECTRUM') should get you more complete content already.

kelle commented 1 year ago

Progress!

spec1d_NIR.meta['header'] = header works without update_header=True AND maintains the COMMENT and HISTORY cards. Woot!
fits.getheader(fits_filename_NIR, 1) also works! Unfortunately, SPECTRUM does not work because the extension is not named that.

This is definitely better, but I still think the meta['header'] should be maintained in hdu0 rather than the 1st extension. Is this an issue for table?

dhomeier commented 1 year ago

fits.getheader(fits_filename_NIR, 1) also works! Unfortunately, SPECTRUM does not work because the extension is not named that.

Sorry, got that confused with your piece of code. No, 'tabular-fits' does not name the extension yet (only just adding that to 'wcs1d', but there it's somewhat more important to identify the different HDUs for flux, mask etc.). Still should perhaps think of a good default name.

dhomeier commented 1 year ago

This is definitely better, but I still think the meta['header'] should be maintained in hdu0 rather than the 1st extension. Is this an issue for table?

See Slack comment; I think table cannot write there. More generally, if at some point we get to write several spectrum HDUs to a single table, header info specific to the individual files may better stay in the respective HDU header.

eteq commented 1 year ago

Don't want to interrupt the above rather promising thread, but I just wanted to make sure I understand the desires here: @kelle, your goal is to make sure the "really actually 1d" spectral use case round-trips, right? I think if we can get that to work with the updates @dhomeier is doing to wcs1d-fits that's great, and is definitely better than another writer.

Alternatively/additionally, is the underlying issue here really that it's not clear from the docs how we want someone to do this?

kelle commented 1 year ago

Yeah, I only care about the wavelength, flux, and uncertainty of one spectrum.

The docs are most certainly not clear regarding the vision for meta['header']

I'm also pretty set that I want the header in hdu0. Could also be in meta but I want a more traditional/familiar looking FITS file and header. The only route I see to do this is to make a new writer/reader.

dhomeier commented 1 year ago

The docs are most certainly not clear regarding the vision for meta['header']

I think the docs have never gone into that level of detail as to what is the recommended structure for metadata, although all the examples of loaders are following the meta['header'] convention (but then I think the way the generic_writer uses it is already inconsistent with that). The docs are certainly a bit lacking in terms of information how to use the existing loaders, and what functionality they have, compared to writing a new custom loader. And for that, having >90 % of https://specutils.readthedocs.io/en/stable/custom_loading.html#creating-a-custom-loader filled with a full source listing of jwst_loader at the moment does not appear very helpful to me.

https://specutils.readthedocs.io/en/stable/custom_loading.html#creating-a-custom-loader

I'm also pretty set that I want the header in hdu0. Could also be in meta but I want a more traditional/familiar looking FITS file and header. The only route I see to do this is to make a new writer/reader.

If it needs to output to a BINTABLE that's out of scope of wcs1d-fits, and I think needing to write something to the Primary HDU woul require a relatively significant extension to tabular-fits; not sure if that still fits in. Note that meta in my comments so far always refers to the Spectrum1D attribute, without any implication where and how that would be written to a file (but in FITS in practice there is little alternative to putting it into a header of some HDU).

kelle commented 1 year ago

@cshanahan1 and I are working on this at a hack day!

kelle commented 1 year ago

I'm working on writing some tests to demonstrated current and desired behavior.

sbailey commented 1 year ago

Documenting some thoughts as followup from the spectro workshop hackfest last week — IIUC @kelle identified that the problem with tabular-fits dropping the metadata when writing + reading was that the writer was putting the metadata in HDU 1 along with the bintable, while the reader was loading the metadata from HDU 0. There was discussion about whether this should be fixed on the writer or the reader side.

For the purpose of the existing "tabular-fits" format, I would like to advocate for keeping the metadata in HDU 1 along with the table, and fixing this on the reader side. 3 reasons in descending order of priority:

Preserves backwards compatibility with any files that people already have written and are using.
By bundling the metadata in the same HDU as the data to which it applies, this provides a natural extension of the format to support SpectrumList objects as well - for each Spectrum1D object in the SpectrumList, write that object to a new BinTableHDU with metadata+data together. These could be read back into a SpectrumList object, with each Spectrum1D constructed from different HDUs. In contrast, if tabular-fits switched to putting the metadata in HDU 0 and the data in HDU 1, this wouldn't work for writing multiple Spectrum1D objects to the same file since HDU 0 metadata would apply to only one of them.
Keeping metadata in the same HDU as the data makes these files slightly easier to work with outside of the context of specutils itself, e.g.
```
spec = Table.read('file-written-with-tabular-fits-format.fits')
spec.meta  # works!
plt.plot(spec['wavelength'], spec['flux'])
```
if the metadata was in HDU 0 you could still access it with additional steps, but it is handy to bundle it in the way that astropy Table.read() and fitsio.read(filename, extension, header=True) work.

To be clear:

I don't have strong opinions about any new formats that want to put the metadata in HDU 0, this is only about the existing tabular-fits format.
I agree that the metadata should survive the round trip of writing + reading.
In general I think that metadata that applies to everything in the file should go in HDU 0, while metadata that applies to individual extensions should go in the header of that extension. The case of a file with a single BinTableHDU is ambiguous, since "everything" is basically HDU 1 since there is no data in HDU 0.

kelle commented 1 year ago

I'm actively working on this right now. :)

A compromise solution is to put the metadata in both headers. That's actually easy enough.

dhomeier commented 1 year ago

That is not something under the control of the reader though, i.e. if the loader encounters an HDUList with non-minimal, but different headers in several HDUs, should it give preference to one, try to merge everything into one meta dictionary, or store them in individual meta['headerN'] dictionaries (if we decide to go with that structure). A good feature would certainly be to allow selecting the HDU via kwarg, as the proposed writer in #1105 is already supporting.

dhomeier commented 5 months ago

Documenting some thoughts as followup from the spectro workshop hackfest last week — IIUC @kelle identified that the problem with tabular-fits dropping the metadata when writing + reading was that the writer was putting the metadata in HDU 1 along with the bintable, while the reader was loading the metadata from HDU 0. There was discussion about whether this should be fixed on the writer or the reader side.

This is actually opposite of the way the current tabular-fits loader is doing it – the reader is loading metadata from HDU 1 (or whichever one is specified as data HDU). There might be some confusion as some table formats can be read with this loader, but by default are read by a specialised one like MUSCLES SED, which does read the header from the primary HDU. Bringing this up again as the proposal in #1113 is currently to make reading metadata (only) from HDU 0 the default for tabular-fits as well.