quickbrick - Githubissues

sbailey commented 9 years ago

We propose "quickbrick" similar to what the Lyman-alpha forest group did for QSOs at the Argonne 2015 workshop: go directly from templates to brick files, skipping over the intermediate steps. This is similar to what could be accomplished with newexp-desi + quickgen + desi_make_bricks, but streamlines the process. Its primary use would be outputting brick files for redshift fitting and spectrum-based analysis development. This would enable a quick trick script stack to generate a quick trick brick set. :)

Outputs

brick-[brz]-{brickname}.fits (3 files) in brick format just like real data, including a fake FIBERMAP.
brick-[brz]-{brickname}-trueflux.fits (3 files): the true flux, also in the brick format minus the IVAR, FIBERMAP, and RESOLUTION HDUs. This would be resolution convolved from the input template so that the true chi2 can be directly constructed from these and the brick files.
brick-{brickname}-truez.fits (1 file): table with true redshift and class. This is kept as a separate file from the trueflux since it is much smaller and may be all you need for some studies.

Irritating detail: the brick files are still exposure based; if an object has more than one exposure it gets more than one entry per file, but the final output redshifts only have one row per object. Should the truez file be row-matched to the brick files (and thus have possibly repeated entries) or should it be only one row per object? Should we define that future truez and zbest files put their objects in TARGETID sorted-order so that they can be row-matched?

Notes:

These truth files save the laborious lookup of brick file -> exposure -> night -> simspec -> targetid -> z matching.
Holler if you'd prefer "brick-trueflux-{brickname}.fits" or some other naming.
The trueflux should be row-matched and resolution convolved to match the brick file so that chi2 = sum( (trueflux.FLUX - brick.FLUX)^2 * brick.IVAR)
Inputs

It would use the desisim.templates.*.make_templates() infrastructure to generate template realizations based upon the files in $DESI_BASIS_TEMPLATES. New/updated templates should come in via that infrastructure.

It would use specsim to model the throughput, resolution, and noise, which will use the parameters in desimodel.

Interface

Usage: quickbrick [options]

Options:
  -h, --help            show this help message and exit
  -b BRICKNAME, --brickname=BRICKNAME
                        brickname
  -f FLAVOR, --flavor=FLAVOR
                        comma separated list of elg,lrg,qso,star,sky,bgs
  -n NSPEC, --nspec=NSPEC
                        number of spectra to simulate
  -o OUTDIR, --outdir=OUTDIR
                        output data directory; default .
  --outdir_truth=OUTDIR_TRUTH
                        optional alternative outdir for truth files

julienguy commented 9 years ago

flavor of quickbrick should also contain a 'mix' (or 'science') flavor to allow the test of target class identification.

sbailey commented 9 years ago

I was embarrassed to tell the Milky Way Survey people that "newexp-desi --flavor science" was only for the dark time survey and was not very inclusive of their science. How very Berkeley of me. We should avoid "science" as meaning only the dark time survey, and use keys like "dark" and "bright" to mean various dark time survey or bright time survey mixes of object classes.

"flavor" was borrowed from SDSS exposure flavors (arc, flat, science...). Perhaps this option is better named "type" or "class" rather than "flavor"?

dkirkby commented 9 years ago

How about saving the true flux as an optional HDU in the standard brick format, to avoid adding three extra files? The same strategy could be used for the truth, but then would be duplicated for each camera so a separate file makes more sense here.

Note that the truth consists of more than TRUEZ. For example, which template was used for the simulation, was a DLA added, etc. I suggest defining a few required binary table columns and then allowing arbitrary additional columns.

sbailey commented 9 years ago

We had considered packing the truth into extra HDUs or columns of the original brick files, but we also wanted the ability to temporarily hide the truth on a validation sample while still having it available for later comparison. It seemed odd to define that sometimes the extra truth information would be in the original files while other times it would be in a different file. That tipped the scales for keeping it in a separate file from the start.

Working through this does make me wonder if splitting the brick files into b vs. r vs. z is unnecessary. Grouping them together would mean extnames like FLUX_B, IVAR_R, RESOLUTION_Z, etc. and 15 HDUs per file instead of 3 files of 5 HDUs each, but that doesn't seem too crazy. Or we could even use the more flexible hierarchy of hdf5 (gasp) ... but that is beyond the scope of this particular ticket. For the most part the I/O of this should be an implementation detail at the boundaries of this code rather than baked into how it is fundamentally done.

Agreed about the required columns + optional columns. I'm not sure how to express that cleanly for truth files that contain a mix of target classes. OIIFLUX makes sense only for ELGs, while DLA makes sense only for QSOs, while REDSHIFT applies to both. Maybe pad with NaNs or default values if the column doesn't apply to that target type?

dkirkby commented 9 years ago

It sounds like there are two use cases, which nudge the design in different directions:

Simulations for pipeline development, where you want convenient and consistent access to the truth at each stage.
Blind data challenges where the data should look just like data and truth should be securely hidden from file consumers.

I recommend designing for whichever of these we expect to be used most heavily, whichever that is, to maximize the responsiveness of the design for most users at the cost of extra overhead for the edge case. If neither scenario is an edge case, they might need different solutions.

sbailey commented 9 years ago

Julien made similar comments. Simulations for pipeline development is the primary use case so it seems that I should give up on the purity of a single data model for both cases and allow the truth to be optionally in the original data file or in a separate file.

moustakas commented 8 years ago

This is a great suggestion. A few comments:

I prefer the idea of keeping "truez" as a separate file. In addition to OIIFLUX (I'm pushing against O2FLUX, @sbailey -- these are forbidden transitions, not molecules!), we'll want to keep things like D4000, Teff, metallicity, etc. -- all the physical parameters inherited from the templates that we may want to look at when doing our redshift success post-mortem. Padding with -1 or some other null value if the variable doesn't apply to the given template should be fine.
It would also be useful for some analyses, especially as we are developing our pipeline and redshift-fitting code, to be able to access the original-resolution template. For example, as we start making and examining QAplots of successes and failures we'll want to look at the convolved/noisy spectrum, the best-fitting model (from the redshift-fitter), and the true template/spectrum. This will help diagnose spectral feature/break confusion and other failure modes that are hard to disentangle from z vs z plots.
We also need to carry forward the photometry/colors in "truth". This is a problem for the QSOs templates right now because we don't have a robust way of predicting the WISE fluxes.

dkirkby commented 8 years ago

We could easily end up with tens of truth parameters associated with each template class, and the set of parameters (and possibly also template classes) will likely evolve. I think this argues for some sort of (key, value) dictionary data structure, with some required keys (TRUEZ, CLASS, ...) and minimal restrictions on additional optional keys.

The dictionary would naturally map onto a FITS HDU header, either in the brick file or an associated file, but other choices are also possible. I believe this covers the important use cases:

Code reading bricks and treating them like data can ignore the optional brick HDU or associated file.
Code reading bricks and treating them as generic simulations can count on required keys being present for generic analysis, and determine the list of available optional keys by introspection.
Code reading bricks simulated for a specific class knows what optional keys to expect and can provide class-specific analysis.

Are there any other use cases we should be considering?

sbailey commented 8 years ago

I might be missing something about @dkirkby's suggestion, but I don't think it maps onto a FITS HDU header. For a given truth parameter, there will be one value per object for the objects to which it applies. HDU headers only naturally accommodate one value per parameter per HDU.

However, metadata tables for different template classes with different truth columns can still be combined using astropy.table.Table.vstack([elgmeta, lrgmeta, ...], join_type='outer'), e.g.:

In [26]: x = Table(dict(id=[1,2,3], a=[1,2,3], b=[4,3,2]))

In [27]: y = Table(dict(id=[4,5,6], a=[3,4,5], c=[2,3,4]))

In [28]: vstack([x, y], join_type='outer')
Out[28]: 
<Table masked=True length=6>
  a     b     id    c  
int64 int64 int64 int64
----- ----- ----- -----
    1     4     1    --
    2     3     2    --
    3     2     3    --
    3    --     4     2
    4    --     5     3
    5    --     6     4

Detail: join_type='outer' is different than what you get using astropy.table.join(), which is more DB-like but doesn't handle the common columns as we want for this case.

The masked values unfortunately don't survive a roundtrip to a FITS file -- they end up as zeros instead of masked values. We might be able to get them to be NaNs instead.

There is a prototype implementation of bin/quickbrick of the quickbrick branch of desispec. For the truth table, it uses the metadata table returned by desispec.templates.XYZ.make_templates(), to which it adds OBJTYPE and TARGETID columns and then writes this as an extname=_TRUTH HDU in the brick file. This v0 only supports one objtype per file, but multiple objtypes could be supported by stacking the metadata tables as described above.

To do before it will be ready for a pull request:

Simulate at high resolution and then downsample (slightly tricky for the resolution matrix data; specsim already supports this for the spectra themselves)
Sanity check on the S/N -- I generated 100 ELGs and ran them through redmonster using desispec/bin/desi_zfind.py and only one failed, which smells too good.
Add the ability to include multiple object types in a single output brick (though even without this, quickbrick could be useful).
Also missing but not strictly necessary (yet): ability to write truth to a separate file.

Even though it may be handy to write the truth table to the original brick files, I do want to retain the option to write it elsewhere to ensure that for final tests the code can't cheat, e.g. by using a truth value to make a quick photo-z prior.

dkirkby commented 8 years ago

I agree this doesn't fit the model of an HDU header and like Stephen's suggestion.

I wasn't thinking of the case where we merge different simulated template classes into the same output file. I suspect we won't be doing that often, but its useful to have the capability. In that case we definitely need a required 'CLASS' string (?) column, in addition to 'ZTRUE', to specify which optional columns are expected, and perhaps also a 'CLASSVERSION' column to support schema evolution.

weaverba137 commented 8 years ago

I'm running quickbrick at NERSC right now, so I'm guessing this issue has been resolved?

desihub / desisim

quickbrick #39

Outputs

Inputs

Interface