Closed weaverba137 closed 7 months ago
To speed up the discussion a little bit, I know that units are dropped internally for performance reasons, which explains why units are not present in the redrock
files. However, it seems like the assembly of zpix
and ztile
files would be IO-bound, not CPU-bound, so there would be no performance penalty to adding units to those files.
A better-phrased warning might be: "the units described in the data model for DESI_SPECTRO_REDUX/SPECPROD
are the units of the values in those files, however, the files do not necessarily contain the unit metadata themselves (TUNITxxx
header cards)."
A better-phrased warning might be: "the units described in the data model for DESI_SPECTRO_REDUX/SPECPROD are the units of the values in those files, however, the files do not necessarily contain the unit metadata themselves (TUNITxxx header cards)."
That is how I have always interpreted the meaning of the datamodel — describing the actual units of the data, not the existence (or not) of TUNIT and BUNIT keywords. I don't think that requires a specific caveat in the text of the data model.
We can separately address if there are any unintended side effects from adding TUNIT keywords to the data files in the future. I agree that in general it is preferable for us to propagate those and it was an oversight (not purposeful) that they are not included in the final redshift catalogs.
That is how I have always interpreted the meaning of the datamodel.
But will the public interpret it that way? Especially since some files do have units and some don't.
Adding units to the files is a post-EDR activity for future specprods / releases.
For the EDR Countdown dashboard I'm moving this into the optional category for adding caveats to the datamodel and/or the known issues section of desidatadocs.
I have removed this ticket from EDR Countdown and opened desihub/desidatamodel#178 to document the specific issue of the presence or absence of units in existing files.
The units are being dropped far earlier than redrock — even the original fibermap generated by assemble_fibermap
doesn't have units.
Short term fix for DR1:
Long term fix for Y3 reprocessing:
assemble_fibermap
so that it outputs units in the first placeThe lowest-level place to fix this is in desispec.io.table.read_table
, which uses fitsio under the hood and loses units. The original motivation of read_table
was to work around the astropy 5.0 "feature" of masking NaNs instead of propagating them (fixed in later astropy versions). However, note that Table.read(filename, 'FIBERMAP')
is ~2x slower than Table(fitsio.read(filename, 'FIBERMAP'))
on compressed files, so even if NaNs weren't an issue, it would be preferable to use fitsio and then re-parse the header to add units to maintain I/O efficiency.
@sbailey, in terms of closing out the 'DR1 Patching' project, what is the status of this ticket?
This is done for the purposes of DR1, but not yet for future productions, so I'll remove it from "DR1 Patching" and move it to "Jura". For DR1, units were added to the redshift catalogs at the end as part of running desi_zcatalog; for future productions we want this units propagation to be automatic through all intermediate files. Steps to check/fix:
--add-units
(which is intended for when the input files don't have units in the first place, like iron)Contributions to fix any of those steps would be valuable, even if someone else needs to dig deeper to fix another step.
Note that anywhere that we use fitsio, it is for a specific purposeful reason, so we shouldn't fix this issue just by replacing fitsio with astropy.table.Table stuff.
Additional note: I originally thought that we'd have assemble_fibermap output the correct units in the first place, and then have all subsequent read+write steps propagate those units. That might be technically messy due to astropy Tables vs. astropy BinTableHDUs vs. fitsio etc. We additionally need to support pipeline steps adding new columns including their units.
An alternative to propagate+add would be to have every writer use desispec.io.fibermap.fibermap_columns to re-assert what the units should be and add those while writing, since that is likely what assemble_fibermap would be using in the first place for the original columns. This would also keep the units for all columns in a single place instead of spread over the N>1 steps that add columns.
As belt-and-suspenders, readers could also complain if they read in a fibermap column with a claimed unit different from fibermap_columns
.
Linking to relevant issue #2171.
Since we're working on assemble_fibermap
anyway, look into taking care of #2149.
The DESI data model claims that columns in
redrock
,zpix
,ztile
andzall
files have units, but in fact they do not. There are noTUNIT
keywords set in any of the files in the fuji zcatalog directory, and a spot check shows that they are not set in individual redrock files either.This also makes it much more difficult to propagate units even further downstream to VACs derived from these files.
For the purposes of EDR, I think the response has to be documentation: "even though the data model claims these columns have units, in fact, the files do not contain unit metadata."
All along we've been strongly encouraging VAC authors to put units in data files, but the pipeline is not actually doing that for some of the most critical outputs, so I think for the future, we should really be more careful about propagating units.