LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

DR2 (and DR3) processing #398

Closed jchiang87 closed 3 years ago

jchiang87 commented 3 years ago

We have plans to produce a Data Release 2 (DR2) covering the entire 300 square degrees of DC2 using the first year (Y1) of data. I think the nominal plan is to use the Run2.2i data, and process all patches as-is, including those in the DDF at full depth.

After discussing some of the needs of the strong lensing (SL) studies with @jiwoncpark, the option arose of combining the Run2.2i and Run3.1i data for this release. In the DDF, we'd use the Run3.1i sensor-visits instead of the Run2.2i versions, and furthermore, we'd down-select the DC2 visits that overlap with the DDF so that we obtain a uniform WFD-like cadence across the entire DC2 region. The SL group could then use the coadd/mulitband results in the DDF for machine learning studies to find strongly lensed systems in WFD regions. I don't think this change would adversely affect the usefulness of DR2 for other Science Working Groups, but they should comment here if this poses problems.

In addition, since SL would like at least 2-year depth at a WFD cadence, I propose that we do a DR3 just for the Run3.1i DDF data, i.e., make warps for Y2 Run3.1i data and combine them with the Y1 data to have two year depth coadds and multiband results in the DDF.

We can itemize the various to-do steps in this issue, but I'd like to document the work, e.g., visit down-selection, etc.. in the DC2-production wiki.

heather999 commented 3 years ago

Starting a to do list and adding some questions.

jchiang87 commented 3 years ago

I don't think we need Eli's definitions. We just replace the Run2.2i raw files with the Run3.1i versions for the cases where they both exist. The Run3.1i sensor-visits were already selected to just cover the DDF.

Given that we're mixing Run2.2i and Run3.1i data, I think it would easiest to make a new repo and ingest the Run3.1i raw files first, then if there's an option to ingest the Run2.2i files into the same repo where it skips existing raw files, we'd get the dataset we want.

We'd have a single visit list and ingest the raw files for each visit. The processing pipeline wouldn't need to any special filtering since it would just process the data in the repo.

heather999 commented 3 years ago

Ok that helped, updated the above to do list to hopefully more accurately capture the steps.

jchiang87 commented 3 years ago

It turns out the visit selection is rather easy: the minion_1016 opsim db identifies visits associated with the WFD and DDF observations by propID. For the Y1 DC2 visits, here is a plot of the pointing directions for the WFD (propID==54) and DDF (propID==56) visits: DR2_pointings So we just need to select propID==54 from minion_1016 to identify the desired visits. I'll document this at the wiki.

jchiang87 commented 3 years ago

Here are visit depth maps for DC2 Y1 and Y1+Y2 using propID=54 visits. The DDF is indicated in the upper right corner.

DC2_Y1_WFD_visit_depth_w_DDF

DC2_Y1Y2_WFD_visit_depth

heather999 commented 3 years ago

Copying from Slack: https://lsstc.slack.com/archives/C978LTJGN/p1598896345006200 New DR2 repo containing the Y1 Run3.1i raw files for the visits with propID==54 /global/cscratch1/sd/descdm/DC2/DR2/repo

jchiang87 commented 3 years ago

Let's move the Run3.1i Y2 items until after the DR2 coadd/multiband processing is done so that we don't need any special handling of the data in the DDF for producing DR2.

jchiang87 commented 3 years ago

I've run processCcd.py on the Y1 Run3.1 raw files using the shifter image lsstdesc/desc-drp-stack:v19-dc2-run2.2-v5. I compared the as-run configs against ones for the CC-IN2P3 processing of the Run2.2i data, and they are identical.

Out of 3427 sensor-visits, there were 41 processing failures. 39 of those sensor-visits have no raw file counterpart in the Run2.2i data. Looking at one of them, it appears to have been generated without an initial checkpoint file, which is consistent with these sensor-visits not having been generated for Run2.2i. I think it's safe to ignore these 39.

The two remaining sensor-visits failed with error messages:

lsst_a_193861_R14_S12_r.fits: TaskError: Fit failed: median scatter on sky = 14.279 arcsec > 10.000 config.maxScatterArcsec

lsst_a_203610_R30_S01_i.fits: RuntimeError: No matches to use for photocal

Since it's just these two out of 3388, I'm not inclined to try follow-up on these, so we should neglect them as well.

To compare the visit-level results for the Run3.1i and Run2.2i versions, I ran the single frame processing validation script in sims_ci_pipe on the DR2/Run3.1i outputs and on the same visits for Run2.2i. The results look consistent with each other for the photometric and astrometric accuracy and for the PSF size and m5 values. (I'll post a plot in the DR2 wiki entry.)

Based on those results, I think we're good to go for adding the Run2.2i visits to the DR2 data repo registry and sym-linking the CC-IN2P3-generated processCcd.py data products. We will just need to omit the data products for lsst_a_193861_R14_S12_r.fits and lsst_a_203610_R30_S01_i.fits.

I'll added some items related to sky correction in the task list above.

heather999 commented 3 years ago

Just out of curiosity - were these processCcd runs on Haswell or KNL and are there some average run times?

jchiang87 commented 3 years ago

On Haswell. Most runtimes are between 2 and 3 minutes. The logs can be grepped:

grep ^real /global/cscratch1/sd/descdm/DC2/DR2/logging/processCcd*.log
heather999 commented 3 years ago

Concerning the Run2.2i Y1 warps, I have a list of all expected warps (though some were removed at CC). Working on transferring the existing ones over to NERSC and I hope that will be completed in the next day or so.

jchiang87 commented 3 years ago

The Run2.2i processCcd.py outputs have been symlinked from

/global/cscratch1/sd/descdm/DC2/Run2.2i-parsl/v19.0.0-v1/rerun/run2.2i-calexp-v1-copy

into the DR2 repo.

I've run skyCorrection.py on a visit in each band that contains Run3.1i data and differenced the resultng images with the coresponding Run2.2i skyCorr images and found that they differ in pixel values by less than 0.2 ADU at most, with the mean and median pixel values of the differenced images typically much less than 0.02 ADU. Here are histograms showing the distribution of minimum, maximum, mean, and median values of the pixels in those per-CCD difference images:

skyCorr_diff_image_pixel_stats_Run3 1i_vs_Run2 2i

Since the Run3.1i data just have a relatively small number of point sources added versus the Run2.2i versions, we'd expect the Run3.1i skyCorr image to be essentially the same as for Run2.2i. So for DR2, we should simply use the Run2.2i skyCorr data.

heather999 commented 3 years ago

The skyMap has been symlinked from: /global/cscratch1/sd/descdm/DC2/Run2.2i-parsl/v19.0.0-v1/rerun/run2.2i-calexp-v1-copy/deepCoadd to /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-calexp/deepCoadd I needed the skyMap to be able to run tract2visit_mapper.py to produce the tract2visit sqlite3 database as discussed at this week's DESC DM meeting. The tract2visit_mapper is going now and I'll add the sqlite db to the /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-calexp/ directory when it's finished.

heather999 commented 3 years ago

Slack discussions concerning configuration parameters and handling the pre-existing warps in the workflow.

heather999 commented 3 years ago

As discussed on Slack, there are 3 visits which are not in the simulated Run2.2i data and include just a couple of sensors in Run3.1i. These are:

find . -xtype l
./00191341-z
./00183810-g
./00207760-y

Removing these symlinks in /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-calexp/ .

heather999 commented 3 years ago

Ran ImageProcessingPipelines/python/util/tract2visit_mapper.py from the dr2/run2.2 branch to produce the tracts_mapping.sqlite3 DB that will be used for the coadd processing. The input visit list was constructed using the list of visits in the dr2-calexp/calexp directory. The resulting sqlite3 DB contains 3018 visits, which matches the number of calexp visits in /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-calexp/ The sqlite3 file has been copied to this directory as well as to /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-coadd which is exactly what CC does for its processing.

heather999 commented 3 years ago

I moved the copy of CC's Run2.2i warps into /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-coadd/deepCoadd Due to the magic of Globus, these files are owned by desc but I set the ACLs to allow descdm full access to all the files/directories - just let me know if you see any problems.

There was already a /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-coadd/deepCoadd directory that I moved aside and renamed /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-coadd/deepCoadd-testing - not that we need it, but it did contain what looked like a test warp for a visit for a particular patch in tract 5064 i-band

jchiang87 commented 3 years ago

As I noted in slack, we can just delete that warp data for tract 5064.

TomGlanzman commented 3 years ago

Certain initial conditions prior to beginning the Parsl-based DR2 processing.

Run account descdm
Butler (Gen2) repo /global/cscratch1/sd/descdm/DC2/DR2/repo
/rerun naming dr2-{calexp,coadd,multiband,metadata}
Workflow code /global/cscratch1/sd/descdm/ParslRun/ImageProcessingPipelines
git repo https://github.com/LSSTDESC/ImageProcessingPipelines/tree/dc2/run2.1
Run directory /global/cscratch1/sd/descdm/ParslRun/dr2
Within the Butler repo:
# participating visits (Y1 WFD) 3018
# of pre-existing warp*.fits files 448,966
Space occupied by pre-existing warp files 73 TB

State of $SCRATCH space:


FILESYSTEM   SPACE_USED   SPACE_QUOTA   SPACE_PCT   INODE_USED   INODE_QUOTA   INODE_PCT
cscratch1    5.33TiB      250.00TiB     2.1%        7.74M        20.00M        38.7%    ```
TomGlanzman commented 3 years ago

Saturday morning report.

An initial test of the Parsl DR2 workflow, processing tract 5063 (in the DDF region), started last night and continues to run but has already revealed a number of failures.

makeCoaddTempExp.py /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-calexp --output /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-coadd --id tract=5063 patch=3,1 filter=y --selectId visit=188998 --configfile /opt/lsst/software/stack/obs_lsst/config//makeCoaddTempExp.py --calib /global/cscratch1/sd/descdm/DC2/DR2/repo/CALIB

which generated this error,

makeCoaddTempExp FATAL: Failed on dataId=DataId(initialdata={'tract': 5063, 'patch': '3,1', 'filter': 'y'}, tag=set()): NoResults: No locations for get: datasetType:skyCorr dataId:DataId(initialdata={'visit': 188998, 'filter': 'y', 'raftName': 'R22', 'detectorName': 'S02', 'detector': 92, 'tract': 5063}, tag=set())

Ref: /global/cscratch1/sd/descdm/ParslRun/dr2/runinfo/000/dm-logs/coadd_for_tract_5063_patch_4-1_filter_y-visit-188998.{stdout,stderr}

Error list for makeCoaddTempExp: tract_5063_patch_3-1_filter_y-visit-188998 tract_5063_patch_3-2_filter_y-visit-188998 tract_5063_patch_4-1_filter_y-visit-188998 tract_5063_patch_4-2_filter_y-visit-188998 tract_5063_patch_6-0_filter_i-visit-196476

An example:

forcedPhotCoadd.py /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-multiband --output /global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-multiband --id tract=5063 patch=6,6 filter=u --configfile /opt/lsst/software/stack/obs_lsst/config//forcedPhotCoadd.py

which generated this error:

forcedPhotCoadd FATAL: Failed on dataId=DataId(initialdata={'tract': 5063, 'patch': '6,6', 'filter': 'u'}, tag=set()): NoResults: No locations for get: datasetType:deepCoadd_meas dataId:DataId(initialdata={'tract': 5063, 'patch': '6,6', 'filter': 'u'}, tag=set())

Ref: /global/cscratch1/sd/descdm/ParslRun/dr2/runinfo/000/dm-logs/multiband_for_tract_5063_patch_6-6-filter-u-forced_phot_coadd.{stdout,stderr}

heather999 commented 3 years ago

Just looking at the first error message:

makeCoaddTempExp FATAL: Failed on dataId=DataId(initialdata={'tract': 5063, 'patch': '3,1', 'filter': 'y'}, tag=set()): NoResults: No locations for get: datasetType:skyCorr dataId:DataId(initialdata={'visit': 188998, 'filter': 'y', 'raftName': 'R22', 'detectorName': 'S02', 'detector': 92, 'tract': 5063}, tag=set())

visit 188998 is one of the visits from Run3.1i but was also in Run2.2i, but there is no skyCorr data for R22 S02:

/global/cscratch1/sd/descdm/DC2/DR2/repo/rerun/dr2-calexp/skyCorr> ls 00188998-y/R22/
skyCorr_00188998-y-R22-S00-det090.fits  skyCorr_00188998-y-R22-S12-det095.fits
skyCorr_00188998-y-R22-S01-det091.fits  skyCorr_00188998-y-R22-S20-det096.fits
skyCorr_00188998-y-R22-S10-det093.fits  skyCorr_00188998-y-R22-S21-det097.fits
skyCorr_00188998-y-R22-S11-det094.fits  skyCorr_00188998-y-R22-S22-det098.fits

so we may need to go back and check which visits now from Run3.1 have raft/sensor combinations that need specific skyCorr data generated

jchiang87 commented 3 years ago

Checking for all of the Run3.1i calexps, that skyCorr file, i.e., the one for 00188998-y-R22-S02, is the only one that's missing. However, when I run skyCorrection.py for that sensor-visit, it fails with

skyCorr FATAL: Failed on dataId={'visit': 188998, 'raftName': 'R22', 'detectorName': 'S02', 'filter': 'y', 'detector': 92}: InvalidParameterError: 
  File "src/math/LeastSquares.cc", line 421, in void lsst::afw::math::LeastSquares::_factor(bool)
    Number of columns of design matrix (1) must be smaller than number of data points (0) {0}
lsst::pex::exceptions::InvalidParameterError: 'Number of columns of design matrix (1) must be smaller than number of data points (0)'

Traceback (most recent call last):
  File "/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/pipe_base/19.0.0/python/lsst/pipe/base/cmdLineTask.py", line 388, in __call__
    result = self.runTask(task, dataRef, kwargs)
  File "/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/pipe_base/19.0.0/python/lsst/pipe/base/cmdLineTask.py", line 447, in runTask
    return task.runDataRef(dataRef, **kwargs)
  File "/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/pipe_drivers/19.0.0+2/python/lsst/pipe/drivers/skyCorrection.py", line 229, in runDataRef
    scale = self.sky.solveScales(measScales)
  File "/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/pipe_drivers/19.0.0+2/python/lsst/pipe/drivers/background.py", line 346, in solveScales
    return solve(mask)
  File "/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/pipe_drivers/19.0.0+2/python/lsst/pipe/drivers/background.py", line 334, in solve
    afwMath.LeastSquares.DIRECT_SVD).getSolution()
lsst.pex.exceptions.wrappers.InvalidParameterError: 
  File "src/math/LeastSquares.cc", line 421, in void lsst::afw::math::LeastSquares::_factor(bool)
    Number of columns of design matrix (1) must be smaller than number of data points (0) {0}
lsst::pex::exceptions::InvalidParameterError: 'Number of columns of design matrix (1) must be smaller than number of data points (0)'

which would be consistent with the skyCorr file being missing in the Run2.2i data. There are probably other missing skyCorr files in the Run2.2i data like this, so I think we'll have to deal with these missing skyCorr files in a similar fashion as the SRS pipeline does.

heather999 commented 3 years ago

I was trying to see how the SRS pipeline deals with this. What I do see is this Slack conversation. and more recently this one which indicates Johann's implementation. Here's the link to the code in IPP's setup_coaddDriver.

johannct commented 3 years ago

This has been discussed several times, in several channels, including in #desc-dc2-workflows : https://lsstc.slack.com/archives/CFL9N02MR/p1593503722274800

jchiang87 commented 3 years ago

In finishing off the last patch that failed in the multiband processing for DR2, Tom encountered the following error:

measureCoaddSources.propagateFlags INFO: Propagating flags dict_keys(['calib_psf_candidate', 'calib_psf_used', 'calib_psf_reserved', 'calib_astrometry_used', 'calib_photometry_used', 'calib_photometry_reserved']) from inputs
measureCoaddSources FATAL: Failed on dataId=DataId(initialdata={'tract': 5062, 'patch': '0,2', 'filter': 'u'}, tag=set()):
 NoResults: No locations for get: datasetType:src dataId:DataId(initialdata={'visit': 217577, 'detector': 132}, tag=set())
Tue Nov 17 15:53:40 PST 2020 wrap-shifter: executable finished with return code 1

Here the measureCoaddSources.py task is looking for flags from the src catalog outputs from the single frame processing (sfp) of visit 217577, detector 132 (R31_S20), but that file is not present in the DR2 repo. This sensor-visit is in one of the rafts that straddle the DDF boundary in that visit, so the sfp output folders for that raft would contain data products for both Run3.1i and Run2.2i sensor-visits. The Run3.1i data were ingested into the DR2 directly, sfp was run on those data, and so the sfp outputs have physical locations in the DR2 repo. For Run2.2i data, rather than ingesting everything from scratch and re-doing all the processing, we planned to symlink the existing sfp data products into the desired folders. This script was run to make those symlinks.

Unfortunately, after looking at other raft-visits that straddle DDF boundary, I found a number of other cases where there are missing Run2.2i sensors. Here is a plot of all of the missing sensor-visits (shown in red) for raft-visits that overlap with the DDF:

DR2_missing_Run2 2i_sensor_visits

The two missing sensor-visits within the DDF boundary are the ones I noted here and so are expected to be missing.

The reason that those other missing sensor-visits didn't trigger the same error as the one noted above is that the warps for those patches were based on the data in the DR2 repo, so the warps (and associated coadds) didn't expect to find those sensor-visits. The coadd that triggered the error included a pre-existing warp file that was copied over from CC-IN2P3, where that sensor-visit was present.

The next step is to make those missing symlinks, regenerate the warps for the affected patches, and then redo the coadd and multiband processing for those patches.

I've attached a file with the list of the 137 affected patches.