desihub / desisim

DESI simulations
BSD 3-Clause "New" or "Revised" License
16 stars 22 forks source link

quickbrick #39

Closed sbailey closed 8 years ago

sbailey commented 9 years ago

We propose "quickbrick" similar to what the Lyman-alpha forest group did for QSOs at the Argonne 2015 workshop: go directly from templates to brick files, skipping over the intermediate steps. This is similar to what could be accomplished with newexp-desi + quickgen + desi_make_bricks, but streamlines the process. Its primary use would be outputting brick files for redshift fitting and spectrum-based analysis development. This would enable a quick trick script stack to generate a quick trick brick set. :)

Outputs

Irritating detail: the brick files are still exposure based; if an object has more than one exposure it gets more than one entry per file, but the final output redshifts only have one row per object. Should the truez file be row-matched to the brick files (and thus have possibly repeated entries) or should it be only one row per object? Should we define that future truez and zbest files put their objects in TARGETID sorted-order so that they can be row-matched?

Notes:

It would use the desisim.templates.*.make_templates() infrastructure to generate template realizations based upon the files in $DESI_BASIS_TEMPLATES. New/updated templates should come in via that infrastructure.

It would use specsim to model the throughput, resolution, and noise, which will use the parameters in desimodel.

Interface

Usage: quickbrick [options]

Options:
  -h, --help            show this help message and exit
  -b BRICKNAME, --brickname=BRICKNAME
                        brickname
  -f FLAVOR, --flavor=FLAVOR
                        comma separated list of elg,lrg,qso,star,sky,bgs
  -n NSPEC, --nspec=NSPEC
                        number of spectra to simulate
  -o OUTDIR, --outdir=OUTDIR
                        output data directory; default .
  --outdir_truth=OUTDIR_TRUTH
                        optional alternative outdir for truth files
julienguy commented 9 years ago

flavor of quickbrick should also contain a 'mix' (or 'science') flavor to allow the test of target class identification.

sbailey commented 9 years ago

I was embarrassed to tell the Milky Way Survey people that "newexp-desi --flavor science" was only for the dark time survey and was not very inclusive of their science. How very Berkeley of me. We should avoid "science" as meaning only the dark time survey, and use keys like "dark" and "bright" to mean various dark time survey or bright time survey mixes of object classes.

"flavor" was borrowed from SDSS exposure flavors (arc, flat, science...). Perhaps this option is better named "type" or "class" rather than "flavor"?

dkirkby commented 9 years ago

How about saving the true flux as an optional HDU in the standard brick format, to avoid adding three extra files? The same strategy could be used for the truth, but then would be duplicated for each camera so a separate file makes more sense here.

Note that the truth consists of more than TRUEZ. For example, which template was used for the simulation, was a DLA added, etc. I suggest defining a few required binary table columns and then allowing arbitrary additional columns.

sbailey commented 9 years ago

We had considered packing the truth into extra HDUs or columns of the original brick files, but we also wanted the ability to temporarily hide the truth on a validation sample while still having it available for later comparison. It seemed odd to define that sometimes the extra truth information would be in the original files while other times it would be in a different file. That tipped the scales for keeping it in a separate file from the start.

Working through this does make me wonder if splitting the brick files into b vs. r vs. z is unnecessary. Grouping them together would mean extnames like FLUX_B, IVAR_R, RESOLUTION_Z, etc. and 15 HDUs per file instead of 3 files of 5 HDUs each, but that doesn't seem too crazy. Or we could even use the more flexible hierarchy of hdf5 (gasp) ... but that is beyond the scope of this particular ticket. For the most part the I/O of this should be an implementation detail at the boundaries of this code rather than baked into how it is fundamentally done.

Agreed about the required columns + optional columns. I'm not sure how to express that cleanly for truth files that contain a mix of target classes. OIIFLUX makes sense only for ELGs, while DLA makes sense only for QSOs, while REDSHIFT applies to both. Maybe pad with NaNs or default values if the column doesn't apply to that target type?

dkirkby commented 9 years ago

It sounds like there are two use cases, which nudge the design in different directions:

I recommend designing for whichever of these we expect to be used most heavily, whichever that is, to maximize the responsiveness of the design for most users at the cost of extra overhead for the edge case. If neither scenario is an edge case, they might need different solutions.

sbailey commented 9 years ago

Julien made similar comments. Simulations for pipeline development is the primary use case so it seems that I should give up on the purity of a single data model for both cases and allow the truth to be optionally in the original data file or in a separate file.

moustakas commented 8 years ago

This is a great suggestion. A few comments:

dkirkby commented 8 years ago

We could easily end up with tens of truth parameters associated with each template class, and the set of parameters (and possibly also template classes) will likely evolve. I think this argues for some sort of (key, value) dictionary data structure, with some required keys (TRUEZ, CLASS, ...) and minimal restrictions on additional optional keys.

The dictionary would naturally map onto a FITS HDU header, either in the brick file or an associated file, but other choices are also possible. I believe this covers the important use cases:

Are there any other use cases we should be considering?

sbailey commented 8 years ago

I might be missing something about @dkirkby's suggestion, but I don't think it maps onto a FITS HDU header. For a given truth parameter, there will be one value per object for the objects to which it applies. HDU headers only naturally accommodate one value per parameter per HDU.

However, metadata tables for different template classes with different truth columns can still be combined using astropy.table.Table.vstack([elgmeta, lrgmeta, ...], join_type='outer'), e.g.:

In [26]: x = Table(dict(id=[1,2,3], a=[1,2,3], b=[4,3,2]))

In [27]: y = Table(dict(id=[4,5,6], a=[3,4,5], c=[2,3,4]))

In [28]: vstack([x, y], join_type='outer')
Out[28]: 
<Table masked=True length=6>
  a     b     id    c  
int64 int64 int64 int64
----- ----- ----- -----
    1     4     1    --
    2     3     2    --
    3     2     3    --
    3    --     4     2
    4    --     5     3
    5    --     6     4

Detail: join_type='outer' is different than what you get using astropy.table.join(), which is more DB-like but doesn't handle the common columns as we want for this case.

The masked values unfortunately don't survive a roundtrip to a FITS file -- they end up as zeros instead of masked values. We might be able to get them to be NaNs instead.

There is a prototype implementation of bin/quickbrick of the quickbrick branch of desispec. For the truth table, it uses the metadata table returned by desispec.templates.XYZ.make_templates(), to which it adds OBJTYPE and TARGETID columns and then writes this as an extname=_TRUTH HDU in the brick file. This v0 only supports one objtype per file, but multiple objtypes could be supported by stacking the metadata tables as described above.

To do before it will be ready for a pull request:

Even though it may be handy to write the truth table to the original brick files, I do want to retain the option to write it elsewhere to ensure that for final tests the code can't cheat, e.g. by using a truth value to make a quick photo-z prior.

dkirkby commented 8 years ago

I agree this doesn't fit the model of an HDU header and like Stephen's suggestion.

I wasn't thinking of the case where we merge different simulated template classes into the same output file. I suspect we won't be doing that often, but its useful to have the capability. In that case we definitely need a required 'CLASS' string (?) column, in addition to 'ZTRUE', to specify which optional columns are expected, and perhaps also a 'CLASSVERSION' column to support schema evolution.

weaverba137 commented 8 years ago

I'm running quickbrick at NERSC right now, so I'm guessing this issue has been resolved?