LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Planning a move to using DM Science Pipelines writeObjectTask.py -> Parquet #380

Closed wmwv closed 2 years ago

wmwv commented 4 years ago

In November 2019 the following ticket was merged and the DM Science Pipelines now have the ability to generate Parquet files that are closer to the planned Science Data Model (SDM) output (i.e., following the DPDD).

https://jira.lsstcorp.org/browse/DM-16234

This adds the ability to

  1. Merge the individual coadd files into an Object Table that just combines the individual measurements. This is currently what DESC's DC2-production merge_object_table.py does.

  2. Transform that Object Table into a DPDD format. This is currently what DESC's DC2-production write_gcr_to_parquet.py does.

  3. Put together the patches. This is currently what DESC's DC2-production merge_parquet_files.py does and we do it before step 2.

  4. The merging is done by WriteObjectTableTask https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/postprocess.py#L78

  5. Then the column names are transformed by transformObjectCatalog.py https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/postprocess.py#L498

  6. And finally the patch-level object tables are merged to tract level by consolidateObjectTable.py: https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/postprocess.py#L589

(The above Python command line tasks are simple wrappers that call the Tasks linked to above which is why I link to the real code in postprocess.py instead of the trivial literal command-line scripts)

The translations applied in step 2 are specified by a YAML file. E.g., for HSC this is

https://github.com/lsst/obs_subaru/blob/master/policy/Object.yaml

An example use on HSC data is given in the JIRA ticket:

writeObjectTable.py /datasets/hsc/repo --rerun RC/w_2019_34/DM-21091:private/<user>/sdm_output --id tract=9697 patch=3,3^3,4^3,5 filter=HSC-G^HSC-R^HSC-I^HSC-Z^HSC-Y -j 3
transformObjectCatalog.py /datasets/hsc/repo --rerun private/<user>/sdm_output --id tract=9697 patch=3,3^3,4^3,5
consolidateObjectTable.py /datasets/hsc/repo --rerun private/<user>/sdm_output --id tract=9697 patch=3,3^3,4^3,5

Where <user> should be replace with an actual user, or more specifically private/<user>/sdm_output should be an actual name of a potential rerun.

To use this

  1. [ ] We should implement an Object.yaml in obs_lsst. This should be done in coordination with DM. There's a mildly tedious level of manual specification of filter names, but in practice it probably would take more time to complain about how tedious it is then to just do it.
  2. [ ] We should test it.
  3. [ ] The consolidateObjectTable.py currently concatenates in a pandas data frame and then saving. @johannct ran in to out-of-memory errors when trying to do this for DC2. We should ask if the butler put operation can support directly using the pyarrow appending functionality.
  4. [ ] For providing these files more broadly to DESC, we should then lift these files out of the Butler repo. This should be relatively trivial. They'll all be in a simple directory with rational names.
  5. [ ] We should check the Object.yaml against our additional identified convenience column names or additional unique content names desired in DESC.

I anticipate that this process will take several months. We should first process Run 2.2i with our current merge_*.py + GCR infrastructure. Then we should test the DM Science Pipelines methods and compare with our results. This will likely result in catching silly bugs and identifying key additional columns we'd like to see added. We can then have discussions about whether these should be added in the obs_lsst Object.yaml file or if there will be an additional layer of processing later done by DESC.

wmwv commented 4 years ago

Note that there is not yet current equivalent pipe_tasks functionality for Source or Forced Source tables (or the DIA versions of either).

johannct commented 4 years ago

the example looks like gen2 syntax.....

wmwv commented 4 years ago

Yes, it is.

johannct commented 4 years ago

I just tried on my tract 3828 testbench with w47 and the patches that @hsinfang put in https://github.com/lsst/obs_lsst/compare/u/hfc/DM-21821. Interactive run on 9 cores took less than 10' to execute the 3 calls, and results in a dpdd parquet file located at /sps/lsst/dataproducts/desc/DC2/Run2.2i/w_2019_47-v5/rerun/t3828_testdpdd_dm/deepCoadd-results/merged/3828/objectTable-3828.parq. The column names are of course very different from what the DC2-production merge script outputs, and I list it here for convenience :

ApFlux_flag ApFlux_flag_apertureTruncated ApFlux_flag_sincCoeffsTruncated ApFlux_instFlux ApFlux_instFluxErr CalibFlux_flag CalibFlux_flag_apertureTruncated CalibFlux_flag_sincCoeffsTruncated CalibFlux_instFlux CalibFlux_instFluxErr Centroid_flag_almostNoSecondDerivative Centroid_flag_edge Centroid_flag_noSecondDerivative Centroid_flag_notAtMaximum Centroid_flag_resetToPeak Dec Extendedness_flag InputCount_flag InputCount_flag_noInputs KronFlux_apCorr KronFlux_apCorrErr KronFlux_flag KronFlux_flag_apCorr KronFlux_flag_bad_radius KronFlux_flag_bad_shape KronFlux_flag_bad_shape_no_psf KronFlux_flag_edge KronFlux_flag_no_fallback_radius KronFlux_flag_no_minimum_radius KronFlux_flag_small_radius KronFlux_flag_used_minimum_radius KronFlux_flag_used_psf_radius PixelFlags PixelFlags_bad PixelFlags_bright_object PixelFlags_bright_objectCenter PixelFlags_clipped PixelFlags_clippedCenter PixelFlags_cr PixelFlags_crCenter PixelFlags_edge PixelFlags_inexact_psf PixelFlags_inexact_psfCenter PixelFlags_interpolated PixelFlags_interpolatedCenter PixelFlags_offimage PixelFlags_saturated PixelFlags_saturatedCenter PixelFlags_sensor_edge PixelFlags_sensor_edgeCenter PixelFlags_suspect PixelFlags_suspectCenter PsfFlux_apCorr PsfFlux_apCorrErr PsfFlux_flag PsfFlux_flag_apCorr PsfFlux_flag_edge PsfFlux_flag_noGoodPixels PsfShape_flag PsfShape_flag_no_pixels PsfShape_flag_not_contained PsfShape_flag_parent_source Ra ShapeRound_Flux ShapeRound_flag ShapeRound_flag_no_pixels ShapeRound_flag_not_contained ShapeRound_flag_parent_source ShapeRound_x ShapeRound_xx ShapeRound_xy ShapeRound_y ShapeRound_yy Shape_flag Shape_flag_no_pixels Shape_flag_not_contained Shape_flag_parent_source calib_astrometry_used calib_photometry_reserved calib_photometry_used calib_psf_candidate calib_psf_reserved calib_psf_used coord_dec coord_ra detect_isPatchInner detect_isPrimary detect_isTractInner grStd izStd lsst_g_smearedBdChi2 lsst_g_smearedBdE1 lsst_g_smearedBdE2 lsst_g_smearedBdFluxB lsst_g_smearedBdFluxBErr lsst_g_smearedBdFluxD lsst_g_smearedBdFluxDErr lsst_g_smearedBdReB lsst_g_smearedBdReD lsst_g_smearedCModelFlux lsst_g_smearedCModelFluxErr lsst_g_smearedExtendedness lsst_g_smearedFwhm lsst_g_smearedHsmShapeRegauss_e1 lsst_g_smearedHsmShapeRegauss_e2 lsst_g_smearedHsmShapeRegauss_flag lsst_g_smearedInputCount lsst_g_smearedIxx lsst_g_smearedIxxPsf lsst_g_smearedIxy lsst_g_smearedIxyPsf lsst_g_smearedIyy lsst_g_smearedIyyPsf lsst_g_smearedKronFlux lsst_g_smearedKronFluxErr lsst_g_smearedKronRad lsst_g_smearedPsFlux lsst_g_smearedPsFluxErr lsst_i_smearedBdChi2 lsst_i_smearedBdE1 lsst_i_smearedBdE2 lsst_i_smearedBdFluxB lsst_i_smearedBdFluxBErr lsst_i_smearedBdFluxD lsst_i_smearedBdFluxDErr lsst_i_smearedBdReB lsst_i_smearedBdReD lsst_i_smearedCModelFlux lsst_i_smearedCModelFluxErr lsst_i_smearedExtendedness lsst_i_smearedFwhm lsst_i_smearedHsmShapeRegauss_e1 lsst_i_smearedHsmShapeRegauss_e2 lsst_i_smearedHsmShapeRegauss_flag lsst_i_smearedInputCount lsst_i_smearedIxx lsst_i_smearedIxxPsf lsst_i_smearedIxy lsst_i_smearedIxyPsf lsst_i_smearedIyy lsst_i_smearedIyyPsf lsst_i_smearedKronFlux lsst_i_smearedKronFluxErr lsst_i_smearedKronRad lsst_i_smearedPsFlux lsst_i_smearedPsFluxErr lsst_r_smearedBdChi2 lsst_r_smearedBdE1 lsst_r_smearedBdE2 lsst_r_smearedBdFluxB lsst_r_smearedBdFluxBErr lsst_r_smearedBdFluxD lsst_r_smearedBdFluxDErr lsst_r_smearedBdReB lsst_r_smearedBdReD lsst_r_smearedCModelFlux lsst_r_smearedCModelFluxErr lsst_r_smearedExtendedness lsst_r_smearedFwhm lsst_r_smearedHsmShapeRegauss_e1 lsst_r_smearedHsmShapeRegauss_e2 lsst_r_smearedHsmShapeRegauss_flag lsst_r_smearedInputCount lsst_r_smearedIxx lsst_r_smearedIxxPsf lsst_r_smearedIxy lsst_r_smearedIxyPsf lsst_r_smearedIyy lsst_r_smearedIyyPsf lsst_r_smearedKronFlux lsst_r_smearedKronFluxErr lsst_r_smearedKronRad lsst_r_smearedPsFlux lsst_r_smearedPsFluxErr lsst_u_smearedBdChi2 lsst_u_smearedBdE1 lsst_u_smearedBdE2 lsst_u_smearedBdFluxB lsst_u_smearedBdFluxBErr lsst_u_smearedBdFluxD lsst_u_smearedBdFluxDErr lsst_u_smearedBdReB lsst_u_smearedBdReD lsst_u_smearedCModelFlux lsst_u_smearedCModelFluxErr lsst_u_smearedExtendedness lsst_u_smearedFwhm lsst_u_smearedHsmShapeRegauss_e1 lsst_u_smearedHsmShapeRegauss_e2 lsst_u_smearedHsmShapeRegauss_flag lsst_u_smearedInputCount lsst_u_smearedIxx lsst_u_smearedIxxPsf lsst_u_smearedIxy lsst_u_smearedIxyPsf lsst_u_smearedIyy lsst_u_smearedIyyPsf lsst_u_smearedKronFlux lsst_u_smearedKronFluxErr lsst_u_smearedKronRad lsst_u_smearedPsFlux lsst_u_smearedPsFluxErr lsst_y_smearedBdChi2 lsst_y_smearedBdE1 lsst_y_smearedBdE2 lsst_y_smearedBdFluxB lsst_y_smearedBdFluxBErr lsst_y_smearedBdFluxD lsst_y_smearedBdFluxDErr lsst_y_smearedBdReB lsst_y_smearedBdReD lsst_y_smearedCModelFlux lsst_y_smearedCModelFluxErr lsst_y_smearedExtendedness lsst_y_smearedFwhm lsst_y_smearedHsmShapeRegauss_e1 lsst_y_smearedHsmShapeRegauss_e2 lsst_y_smearedHsmShapeRegauss_flag lsst_y_smearedInputCount lsst_y_smearedIxx lsst_y_smearedIxxPsf lsst_y_smearedIxy lsst_y_smearedIxyPsf lsst_y_smearedIyy lsst_y_smearedIyyPsf lsst_y_smearedKronFlux lsst_y_smearedKronFluxErr lsst_y_smearedKronRad lsst_y_smearedPsFlux lsst_y_smearedPsFluxErr lsst_z_smearedBdChi2 lsst_z_smearedBdE1 lsst_z_smearedBdE2 lsst_z_smearedBdFluxB lsst_z_smearedBdFluxBErr lsst_z_smearedBdFluxD lsst_z_smearedBdFluxDErr lsst_z_smearedBdReB lsst_z_smearedBdReD lsst_z_smearedCModelFlux lsst_z_smearedCModelFluxErr lsst_z_smearedExtendedness lsst_z_smearedFwhm lsst_z_smearedHsmShapeRegauss_e1 lsst_z_smearedHsmShapeRegauss_e2 lsst_z_smearedHsmShapeRegauss_flag lsst_z_smearedInputCount lsst_z_smearedIxx lsst_z_smearedIxxPsf lsst_z_smearedIxy lsst_z_smearedIxyPsf lsst_z_smearedIyy lsst_z_smearedIyyPsf lsst_z_smearedKronFlux lsst_z_smearedKronFluxErr lsst_z_smearedKronRad lsst_z_smearedPsFlux lsst_z_smearedPsFluxErr objectId parentObjectId patch patchId refBand refExtendedness refFwhm refIxx refIxxPsf refIxy refIxyPsf refIyy refIyyPsf riStd tract tractId x xErr xy_flag y yErr zyStd

johannct commented 4 years ago

running write_gcr_to_parquet.py on the same test tract prod, and commenting the offending lines in https://github.com/LSSTDESC/DC2-production/issues/381, I get the following columns

magerr_r psf_fwhm_z psFlux_i IxxPSF_g psFlux_flag_u psf_fwhm_g cModelFluxErr_z cModelFluxErr_g psf_fwhm_u I_flag_g mag_y_cModel mag_u magerr_u_cModel psFlux_u cModelFlux_flag_i cModelFlux_z IxxPSF_z mag_r_cModel magerr_y_cModel IxxPSF_y psFluxErr_r psFluxErr_g psFluxErr_i Ixy_i psFlux_z Ixx_u IyyPSF_z I_flag_y Ixx_r cModelFluxErr_r IyyPSF_r psFlux_flag_y psFluxErr_u Ixy_r Ixy_g xy_flag Ixy_u mag_z psFlux_flag_r Ixy IxxPSF Ixx_y I_flag Iyy_g cModelFlux_flag_y parentObjectId magerr_z_cModel cModelFlux_flag_g Iyy Ixx psFlux_y Ixy_y magerr_i_cModel Iyy_i psFlux_flag_g IxxPSF_i mag_z_cModel Iyy_r yErr psFluxErr_y I_flag_z Iyy_u IyyPSF_u blendedness psFlux_r mag_i y cModelFlux_flag_u IxyPSF_u objectId cModelFlux_i x IxyPSF_y good IxyPSF_r Ixy_z Ixx_g cModelFlux_flag_r mag_i_cModel psFlux_flag_z Ixx_i IxyPSF_z magerr_g IyyPSF cModelFlux_y cModelFlux_g mag_u_cModel mag_y I_flag_u cModelFluxErr_y IxyPSF_g psf_fwhm_i I_flag_r magerr_y mag_g_cModel mag_g magerr_u IxyPSF_i ra IyyPSF_y psFluxErr_z magerr_g_cModel cModelFluxErr_i Iyy_y dec Iyy_z cModelFluxErr_u psNdata xErr psFlux_flag_i IyyPSF_i magerr_r_cModel cModelFlux_u psFlux_g cModelFlux_flag_z mag_r I_flag_i cModelFlux_r magerr_i IxyPSF psf_fwhm_y clean IyyPSF_g IxxPSF_r magerr_z psf_fwhm_r extendedness IxxPSF_u Ixx_z

katrinheitmann commented 2 years ago

This is now obsolete.