LSSTDESC / SSim_DC1

Configuration, production, validation specifications and tools for the DC1 Data Set.
4 stars 2 forks source link

Prepare subset (single patch?) of data to transfer to lsst-dev for QA setup #42

Closed cwwalter closed 6 years ago

cwwalter commented 7 years ago

@laurenam from DM at Princeton has a set of QA scripts that make diagnostic plots to check the quality of the DM processing. Look here to see examples used on HSC:

https://jira.lsstcorp.org/browse/DM-10044

If we make a subset of the data (perhaps one patch?) and get it to lsst-dev, Lauren has offered to make the tweaks to get her scripts to run on our output. Thanks Lauren! Then, we can install and run them ourselves at NERSC on our full output and later runs.

So the first thing is understanding what part of the output is actually needed by Lauren and if it is easy to just copy a sub-directory from the output directories. Then, someone with lsst-dev access will need to copy them to the appropriate area.

laurenam commented 7 years ago

I basically need the outputs of a processing run in a properly butlerized repo (I usually think of this as a given processing as a rerun). For the visit-level analysis, a single visit will suffice to test out the code and at the coadd-level, a single patch should do (I don't need all the visit-level inputs that went into the coadd). We have a colorAnalysis script to look at the stellar locus on color-color plots, but I'm under the impression you only have one band in this DC, so I won't be able to test those scripts yet.

I should also add that, in order to do this work, I will need first need to talk with my pointy-haired superiors to see if they can prioritize this and get it into a sprint...but I am willing!

cwwalter commented 7 years ago

Thanks!

basically need the outputs of a processing run in a properly butlerized repo (I usually think of this as a given processing as a rerun). For the visit-level analysis, a single visit will suffice to test out the code and at the coadd-level, a single patch should do (I don't need all the visit-level inputs that went into the coadd).

The think that I am a bit confused by (and it is probably trivial) is that we have a properly butlerized repo. But, it has 75Tb in it! So, I'm not sure how to copy just the pieces out of it you need. Is that obvious?

We have a colorAnalysis script to look at the stellar locus on color-color plots, but I'm under the impression you only have one band in this DC, so I won't be able to test those scripts yet.

Yes, DC1 is 40 sq degrees in r-band only. DC2 will likely by 300 sq degrees in all 6 bands.

laurenam commented 7 years ago

Sorry for the confusion. The qa scripts are run as CommandLineTasks, so they are very similar to running any other CommandLineTasks (e.g. processCcd.py). As an example, on lsst-dev, I used the command:

"hscVisitAnalysis.py /datasets/hsc/repo/ --rerun RC/w_2017_28/DM-11184/:private/lauren/DM-11090/w_2017_28/ --id visit=1166 ccd=0..8^10..103 --tract=9813 --config doApplyUberCal=False"

to produce the plots for the processed single frame visit data that lives in /datasets/hsc/repo/rerun/RC/w_2017_28/DM-11184 (and the tract information is actually required for the visit-level analysis, so your skymap is required here as well). The output goes to /datasets/hsc/repo/rerun/private/lauren/DM-11090/w_2017_28/ in this case. While that repo contains ~8Tb, to run the above I only need the outputs relevant to that particular visit (including the schema, config, metadata, and, of course, the calexps and catalogs).

The same goes for coadd, where the data is identified by tract/patch/filter, e.g. --id tract=9813 patch=4,5 filter=HSC-I. So there I would only need all the deepCoadd output relevant to patch 4,5.

So you would only need to copy over the content of the repo relevant to a single visit and a single patch...does that make sense?

I will need to add some datasets to obs_lsstSim since they are not part of the common list in obs_base.

cwwalter commented 7 years ago

So you would only need to copy over the content of the repo relevant to a single visit and a single patch...does that make sense?

Right. So this is the part I am asking for help on.

We have 75 TB of stuff in these directories.

cori08:DC1-imsim-dithered % ls
_mapper             deep_assembleCoadd_metadata/     ref_cats@
background_values/  deep_makeCoaddTempExp_metadata/  ref_cats_orig/
calexp/             eimage/                          registry.sqlite3
config/             icExp/                           schema/
deepCoadd/          icSrc/                           src/
deepCoadd-results/  processEimage_metadata/          srcMatch/

Each of the directories has all of the visits in them. So, is there a straightforward way for us to extract what you need?

laurenam commented 7 years ago

Ok, looking at LsstSimMapper.yaml, I think I need:

Also, I will need to be able to setup and use the same reference catalogs you used...how does that work for sims?

cwwalter commented 7 years ago

Also, I will need to be able to setup and use the same reference catalogs you used...how does that work for sims?

I'm going to let @SimonKrughoff answer or point you to the correct person (either @danielsf or @jchiang87 may also know).

Is there someone interested in working with Lauren to copy these files for her and then later using her DM QA scripts to run on our output? @jamesp-epcc Would you be interested in this as a way to start to get familiar with the output?

jamesp-epcc commented 7 years ago

Yes, I am happy to give this a go. It sounds like a good way to get familiar with the data.

jamesp-epcc commented 7 years ago

Sorry for asking such a basic question, but what is the full path to the "DC1-imsim-dithered" directory on Cori? I've had a look in a few places but can't find it. It may be in a location that I don't currently have permission to access...

cwwalter commented 7 years ago

/global/cscratch1/sd/descdm/DC1/DC1-imsim-dithered

danielsf commented 7 years ago

You can find the DC1 reference catalog at

/project/projectdirs/lsst/danielsf/dc1_reference_catalog_8deg_radius.txt

jamesp-epcc commented 7 years ago

Thanks. I am able to access the data now. I have written a script that extracts the files Lauren listed above for a single visit and single patch. This comes to about 5 or 6GB for the ones I have tried so far. What would be the best way to transfer this data to Lauren?

cwwalter commented 7 years ago

Lauren needs this data on lsst-dev. This is a machine used by the LSST developers at the NCSA center in Illinois. I don't have access to the machine but there are many who do. We need to find someone who can copy the data there (since Lauren doesn't have access to NERSC). Simon is out for a few days since he is in the process of moving to Tucson.

Who is a good candidate for this?

jamesp-epcc commented 7 years ago

I don't have access to lsst-dev myself, but I have put the extracted data in /project/projectdirs/lsst/jamesp/extracted on Cori, where I believe someone else on the project should be able to read it. (This data is for visit 1993939 and patch 18,13 but I can easily rerun the script if a different visit or patch is preferable).

jchiang87 commented 7 years ago

I can do it.

cwwalter commented 7 years ago

I can do it.

Great.

jchiang87 commented 7 years ago

I've copied the data from /project/projectdirs/lsst/jamesp/extracted to /home/jchiang/DC1/extracted on lsst-dev. @laurenam let me know if you have problems accessing those files.

laurenam commented 7 years ago
lauren@lsst-dev01:~ $ ls /home/jchiang/DC1/extracted
ls: cannot access /home/jchiang/DC1/extracted: Permission denied

@jchiang87 I think you need to give me read permission on your home dir.

jchiang87 commented 7 years ago

@laurenam I've changed the permissions. Please try again.

laurenam commented 7 years ago

Have you run the forced measurements on the coadds as part of DC1? I don't see the forced source files in the directory @jchiang87 created on lsst-dev /home/jchiang/DC1/extracted. Of potential note, forcedPhotCoadd.py gets run when using the multibandDriver.py driver script in pipe_drivers, but not when running the non-driver multiband.py command-line version. My plotting scripts make great use of the forced output (it's an extremely important dataset for many science cases)...would it be possible for you to run that on (at least) tract=0 patch=18,13 and add the output to the above directory?

laurenam commented 7 years ago

I might be able to run it myself, but I would need the contents of the deepCoadd-results/merged/ directory for tract=0 patch=18,13.

jchiang87 commented 7 years ago

I'm fairly certain that we don't run forcedPhotCoadd.py in our version of the Level 2 pipeline, but @SimonKrughoff or @tony-johnson would be able to say more definitely. In case you want to try to run it yourself, I copied the contents of that directory from NERSC to /home/jchiang/DC1/extracted/deepCoadd-results/merged/0/18,13 on lsst-dev. Otherwise, if you send me the full command line, I could try to run it at NERSC.

laurenam commented 7 years ago

Thanks...I think I've got it running now. For reference, the command line looks something like forcedPhotCoadd.py /home/jchiang/DC1/extracted/ --output /datasets/hsc/repo/rerun/private/lauren/DM-11452/ --id tract=0 patch=18,13 filter=r

cwwalter commented 7 years ago

Just for my education:

I thought forced photometry was run on all of the warped exposures in order to get the flux we see for cmodel and psf_flux. Is that something else?

jchiang87 commented 7 years ago

We are running forcedPhotCcd.py on the warped images (?) for each visit for the lightcurves in Twinkles (but not for the DC1 data). I think the cmodel and psf_flux measurements for DC1 are obtained from the measureCoaddSources.py task. Not sure how those results would differ from the output of forcedPhotCoadd.py.

laurenam commented 7 years ago

Some info on forced measurements from Jim Bosch's HSC pipeline paper (https://arxiv.org/abs/1705.06766):

The final step is another run of the source measurement suite, but this time in forced mode: we hold all position and shape parameters fixed to the values from the previous measurement in the reference band. This ensures that the forced measurements are consistent across bands and use a well-constrained position and shape, which is particularly important for computing colors from differences between magnitudes in different bands.

img_1433

cwwalter commented 7 years ago

Ah, so if this is forced across bands for the merged exposures, is it relevant for us since we are only using r-band in DC1?

laurenam commented 7 years ago

Good point! Take the above as useful information for future, multiband, releases :)