desihub / desiutil

General DESI utilities, shell scripts, desiInstall, etc.
BSD 3-Clause "New" or "Revised" License
3 stars 9 forks source link

utilities to add units to tables #191

Closed sbailey closed 1 year ago

sbailey commented 1 year ago

One of our standard VAC recommendations is to add units to table columns. It would be helpful if we had an easy tool to do this to help users do the right thing. Below are suggested interfaces to implement to streamline adding units to a FITS table post-facto.

function interface

def add_table_units(filename, extname, colunits, strict=True):
    """
    Add units to table in ``filename`` HDU ``extname``

    Args:
        filename (str): filename of FITS file to modify
        extname (str or int): HDU extension name or number
        colunits (dict): dict with key,value = colname,colunits to update

    Options:
        strict (bool): if True (default), require all columns to exists and units to be valid FITS-format units

    If ``strict`` is True, Raises ValueError if a ``colname`` does not appear in that table
    or if a ``colunits`` isn't a valid FITS unit.
    If an input results in a ValueError, the output file remains unchanged.

    Sets the BUNITnn keyswords in ``filename`` extension ``extname``, modifying the file in-place.
    Other extensions are unchanged.
    """

Or maybe split out strictunits vs. strictcolnames instead of using a single strict for both

command line

Wrap the above function in a command line script to post-facto update the units in a FITS file on disk:

update_table_units --input/-i FILENAME [--extname/-e EXTNAME] --units COLNAME1=UNITS1 COLNAME2=UNITS2 ...

--extname is optional only if the file contains a single table in HDU 1.

doing it right the first time

It would also be helpful if we had cookbook documentation about how to add units in the original code writing the tables when the source data is an astropy Table, a numpy structured array, or a Pandas dataframe. The spectro pipeline does this with desispec.io.util.write_bintable, which is not particularly simple and not ideal as a dependency for science codes writing tables if they don't otherwise have a dependency on desispec. Do we have simpler recipes to recommend?

Bonus points: comments

It would also be nice if the same utilities could easily add comments to table columns (like write_bintable) so that the files are more naturally self-documenting.

geordie666 commented 1 year ago

An additional idea: desitarget.io.write_with_units() automatically looks up the units for common column names from a yaml file stored in a data directory.

desitarget.io.write_with_units() has some frailties. It was only designed to work with fitsio and, consequently, some of the units stored in the yaml file aren't strictly FITS compliant. So, I certainly wouldn't copy it blindly.

But, I found the idea of having a database of units pretty slick (if I'd actually been smart enough to get the units correct!) because I didn't have to remember the unit for a common column name like PMRA or PARALLAX_IVAR, or include the unit every time I wrote RA/DEC/etc.

moustakas commented 1 year ago

I really like @geordie666's solution, and if I had thought of it (or known about it) I would have implemented something like that for FastSpecFit!

weaverba137 commented 1 year ago

This seems reasonably straightforward, but there is a technical detail related to comments that should be mentioned. There are two possible ways to add a comment to a column in a binary table. There's the way everyone probably thinks of immediately:

TTYPE01 = 'Z    '       / Redshift

and then there is:

TTYPE01 = 'Z    '
TCOMM01 = 'Redshift'

The latter is probably what we actually want, because there is a known issue with fitsio related to comments on TTYPExx header keywords.

weaverba137 commented 1 year ago

In terms of strictness, I think it would be more reasonable to emit a warning if a column doesn't exist in the FITS file, rather than raise an error. After all, if there is no column to add units or comments to, then the function can simply do nothing, and that's probably OK.

aureliocarnero commented 1 year ago

In terms of where to read the units and descriptions from, I think that with the ongoing effort to create the column_descriptions file in the desidatamodel repository, maybe it would make sense to read these informations from there, having a centralized document for all the DESI columns.

aureliocarnero commented 1 year ago

Also, a comment about the function interface, in astropy.io.fits, the column in the header associated with units is TUNITnn, not BUNITnn

moustakas commented 1 year ago

In terms of where to read the units and descriptions from, I think that with the ongoing effort to create the column_descriptions file in the desidatamodel repository, maybe it would make sense to read these informations from there, having a centralized document for all the DESI columns.

This shouldn't be the final word, but as the developer of a code (FastSpecFit) which depends on some but not all of the desihub software, I would be more in favor of putting the unit definitions in desiutil, which is a critical dependency (unlike desidatamodel, which I really don't want to have to bring in). And if the decision is made to put these definitions in desiutil, I'd be happy to contribute by including a wider range of non-standard units like equivalent widths, etc.

geordie666 commented 1 year ago

I agree with @moustakas. The dependency should be the other-way-around. i.e., if anything, desidatamodel would have a dependency on the unit tables in desiutil.

I think it would be fine to copy across the unit tables from desitarget as a first step, fixing the non-FITS-compliant units. A possible test for how to fix those non-compliant units is noted at the bottom of the desitarget data model page.

ashleyjross commented 1 year ago

How difficult would it be to move the column descriptions work to desiutil and work to merge everything there?

aureliocarnero commented 1 year ago

Cool! Sounds readonable to have it in desiutil. Actually, the effort in the datamodel was because from the beggining units and descriptions were not present in the fits files and filling those by hand was very tiresome, but if we develop tools in utils that add/check those directly to the fits files, datamodel would directly inherit those infos from the fits files.

What Im in favour is to have just one centralized description file for this, and at the moment, the one in datamodel is more complete, so we should merge both.

El mar, 28 mar 2023, 17:01, ashleyjross @.***> escribió:

How difficult would it be to move the column descriptions work to desiutil and work to merge everything there?

— Reply to this email directly, view it on GitHub https://github.com/desihub/desiutil/issues/191#issuecomment-1487200782, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXSAV35M7RZFRVP7USEX2LW6MDM5ANCNFSM6AAAAAAWF3RQMQ . You are receiving this because you commented.Message ID: @.***>

sbailey commented 1 year ago

It feels like "the better is the enemy of the good" has impacted this ticket. i.e. it would quite helpful to have utility scripts/functions to add units to FITS tables (the good), even if we don't yet have fully automated support for propagating units of standard column names (the better). And even if/when we implement "the better", we still need "the good" for cases that aren't covered by the default list of column:units.

weaverba137 commented 1 year ago

I recommend that in desiutil we concentrate purely on the low-level adding units functionality. Desidatamodel can inherit from that and convert the column:units definitions inheriting the low level functionality from desiutil.

sbailey commented 1 year ago

We need this for post-facto patching DR1 redshift catalogs, so I'm going to add it to the DR1 Patching project.

weaverba137 commented 1 year ago

Corner cases to consider:

weaverba137 commented 1 year ago

I've discovered two problems: astropy/astropy#15255 and astropy/astropy#15256. Although these involve different types of FITS HDUs, the effect is the same: it may not be possible to create a byte-for-byte exact copy of a FITS file when copying HDUs from one file to another.

Why is this so serious? Presumably, when we are modifying the units in a HDU, we absolutely do not want to modify any other HDU. But these bugs indicate that other HDUs may be modified as a side effect.

I have a workaround for astropy/astropy#15256. It may be possible to develop a workaround for the other bug, but I just want to prepare the ground in case I cannot, or it takes longer than a few days.

sbailey commented 1 year ago

OK, thanks for identifying those gotchas. For the moment, let's shift focus from the original "add units to a pre-existing file" to making it easier to create the unit-ful file in the first place with a function for "add units to a pre-existing table" that would be used before writing the file in the first place. That second case (adding units to a Table based on column names) is also what we most immediately need for updating desi_zcatalog to re-add units while writing the summary catalogs.

i.e. add the glue to parse an input file like column_descriptions.csv and use that to add units to table columns, while providing options for what to do if a given column already has units.

Bonus points if you can also propagate the descriptions. I don't know if/how those are internally handled by astropy Tables for whether that is an easy or hard task.

weaverba137 commented 1 year ago

OK, that should be relatively easy, at least for units.

I was already developing the functionality to add comments (i.e. the descriptions) to FITS files. I'm not sure how that works for Tables.

In other news, it turns out that even though the TCOMMxx header card is used, it is not (yet) part of the FITS standard, so we might have to fall back on using header card style comments:

TTYPE01 = 'Z    '       / Redshift
weaverba137 commented 1 year ago

Further technical details on the relationship between FITS binary tables and Astropy Tables:

When Table.read() reads a FITS binary table (BinTableHDU), the keywords for the HDU are preserved in the .meta attribute, along with their values. However, all FITS header comments are lost.

This means that round-tripping Table.read() --> add units --> somehow add comments --> Table.write() will not add comments. One would have to add units to the table, then convert the table to a BinTableHDU, then add comments.

moustakas commented 1 year ago

I haven't tested this with the latest versions of astropy, but another issue I tripped up on with FastSpecFit is that calling astropy.table.vstack on two astropy.table.QTables which have units does not preserve the units.

In the end what I do is to add the units right before writing out.

Please ignore if this comment isn't helpful.

weaverba137 commented 1 year ago

Have you reported that issue to astropy? If so, what is the issue number?

But good point, the intent of this function is to do exactly what you do, add the units back in after creating or modifying the table.

sbailey commented 1 year ago

Two gotchas:

  1. Many of the comments in desidatamodel column_descriptions.csv are longer than fit in FITS due to the 80 column limit per line resulting in truncated comments like

    TTYPE3  = 'SUBPRIORITY'        / Random subpriority [0-1) to break assignment ti

    In some cases we could reasonably trim down the comment to fit without losing info; in other cases the description needs enough caveats that we may need to retain the longer comment for the desidatamodel documentation and support a "ShortDescription" column with an abbreviated version, e.g. "Declination" instead of "Barycentric ICRS Declination tied to Gaia DR2 blah blah"

  2. FITS doesn't formally support nanomaggies, and astropy.io.fits.convenience.table_to_hdu is picky about that:

    
    from desiutil.annotate import annotate_table
    from astropy.table import Table
    from astropy.io import fits

t = Table() t['RA'] = np.arange(5) t['FLUX'] = np.arange(5)

annotate_table(t, units=dict(RA='deg', FLUX='nanomaggy'), inplace=True) print(t)

hdu = fits.convenience.table_to_hdu(t) print(hdu.header)

Results:

RA FLUX deg nmgy


0 0 1 1 ... WARNING: The unit 'nmgy' could not be saved in native FITS format and cannot be recovered in reading. It can roundtrip within astropy by using QTable both to write and read back, though one has to enable the unit before reading. [astropy.io.fits.convenience] XTENSION= 'BINTABLE' / binary table extension
BITPIX = 8 / array data type
NAXIS = 2 / number of array dimensions
NAXIS1 = 16 / length of dimension 1
NAXIS2 = 5 / length of dimension 2
PCOUNT = 0 / number of group parameters
GCOUNT = 1 / number of groups
TFIELDS = 2 / number of table fields
TTYPE1 = 'RA '
TFORM1 = 'K '
TUNIT1 = 'deg '
TTYPE2 = 'FLUX '
TFORM2 = 'K '
END



Note that `annotate_table` successfully added nanomaggy -> nmgy to the FLUX column, but `astropy.io.fits.convenience.table_to_hdu` stripped it back out and the resulting HDU does not have a TUNIT2 for the TTYPE1='RA' column.

One the plus side, `desispec.io.util.write_bintable(filename, table, units=dict_of_units)` works since it adds the TUNITn keywords "by hand" based upon dict_of_units[colname] without checking that they are valid FITS units (i.e. let the user violate the standard if they want to or don't otherwise know, which could be considered a feature or a bug...)
weaverba137 commented 1 year ago

Right, I already made a note about long comments.

It looks like we have a convenient clue in the warning. We use QTable instead of Table, and pre-define the unit, as suggested in https://docs.astropy.org/en/stable/units/combining_and_defining.html.

weaverba137 commented 1 year ago

Some notes on defining units:

>>> import astropy.units as u
>>> maggy = u.def_unit('maggy', 3631.0 * u.Jy)
>>> maggy
Unit("maggy")
>>> 1.0 * maggy 
<Quantity 1. maggy>
>>> (1.0 * maggy).to(u.Jy)
<Quantity 3631. Jy>
>>> nanomaggy = u.def_unit('nanomaggy', 1.0e-9 * maggy)
>>> (1.0 * nanomaggy).to(u.Jy)
<Quantity 3.631e-06 Jy>
>>> u.Magnitude(1.0 * maggy)
<Magnitude -0. mag(maggy)>
>>> u.Magnitude(1.0 * nanomaggy)
<Magnitude -0. mag(nanomaggy)>

One would need to similarly account for all the spelling variations: maggie, maggies, nmgy, nanomaggies, etc.

sbailey commented 1 year ago

When reading a redshift catalog with nanomaggy units, astropy.table.Table complains with:

WARNING: UnitsWarning: 'nanomaggy' did not parse as fits unit:
At col 0, Unit 'nanomaggy' not supported by the FITS standard.
If this is meant to be a custom unit, define it with 'u.def_unit'.
To have it recognized inside a file reader or other code, enable
it with 'u.add_enabled_units'. For details, see
https://docs.astropy.org/en/latest/units/combining_and_defining.html 
[astropy.units.core]

I have not succeeded in following those instructions to eliminate the message, however.

Attempt 1, define "nanomaggy":

import astropy.units as u
nanomaggy = u.def_unit('nanomaggy', 1.0e-9 * 3631*u.Jy)
u.add_enabled_units( [nanomaggy,] )

Result:

ValueError: Object with name 'nanomaggy' already exists in namespace.
Filter the set of units to avoid name clashes before enabling them.

Attempt 2, use the "nanomaggy" already defined in astropy:

import astropy.units as u
u.add_enabled_units( [u.Unit('nanomaggy'),] )
from astropy.table import Table
zcat = Table.read('/pscratch/sd/s/sjbailey/desi/dev/zcat_options/ztile-sv2-dark-cumulative-main.fits')

Result, same warning message as before:

WARNING: UnitsWarning: 'nanomaggy' did not parse as fits unit:
At col 0, Unit 'nanomaggy' not supported by the FITS standard.
If this is meant to be a custom unit, define it with 'u.def_unit'.
To have it recognized inside a file reader or other code, enable
it with 'u.add_enabled_units'. For details, see
https://docs.astropy.org/en/latest/units/combining_and_defining.html 
[astropy.units.core]

i.e. astropy.units knows about nanomaggy, but how do we get the astropy.io.fits reader to acknowledge that?

weaverba137 commented 1 year ago

Indeed, maggy is defined in astropy.units.photometric. However, it is nevertheless not part of the FITS standard. That is what the first two lines of the warning say.

I don't know why the second part of the warning is printed though, and we may have to create an issue on astropy to explore that.

sbailey commented 1 year ago

Yeah, (nano)maggy is not part of the FITS standard, but the second part of the error message implies that there is some way of registering the unit to have it recognized as valid anyway, which I thought would silence the warning.

As a brute force alternative, elsewhere in desispec.io we've used code like warnings.filterwarnings('ignore', message="'.*nanomaggies.* did not parse as fits unit.*") to silence warnings that we didn't care about.

weaverba137 commented 1 year ago

Here's the problem, it's built into the mechanism to parse a unit as a FITS unit, and therefore the message is actually misleading.

In other words, since maggy is not in the FITS standard, it can't be parsed as a FITS unit, even though it is defined in other unit systems. But the error message doesn't know about that.

weaverba137 commented 1 year ago

So I think this should be reported upstream, I'll try to take care of that later today. Filtering the warning, since it is clearly misleading, may be the best option.

weaverba137 commented 1 year ago

@sbailey, regarding your code snippet above. If I try to reproduce it, I do get the same warning, but the table is read and the column does have units of nanomaggy. So is there something I'm missing, or do you just not want to see the warning?

sbailey commented 1 year ago

Yes, the table is read and the units are fine. The only problem is the extraneous verbose warning message.