LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Some skysim5000 hdf5 files have an inconsistent schema which prevent their conversion to parquet #426

Open boutigny opened 2 years ago

boutigny commented 2 years ago

A small fraction of the skysim5000 healpixels (52 out of 1568) in hdf5 format have an inconsistent schema for some native quantities which prevent their conversion to parquet format. The following fields have been identified as possibly problematic: lightcone_replication: int64 lightcone_rotation: int64 baseDC2/source_halo_mvir The inconsistency is between the files corresponding to the 3 redshift intervals.

While it would be better to fix this problem upstream, it is also possible to hack the conversion script as in https://github.com/LSSTDESC/DC2-production/tree/u/boutigny/fix_schema_parquet_skysim

evevkovacs commented 2 years ago

@patricialarsen @yymao The fact that this inconsistency is occurring in just a few healpixels is mysterious. Can you provide a list of the healpixels which have a problem so that we can investigate further? The first 2 variables are copied from the input files used for the production pipeline, so it is possible that those input files have an issue. The last variable is copied from UniverseMachine inputs. What exactly is the problem with baseDC2/source_halo_mvir? None of the above variables are actually produced by the production code and have been included in the catalog for completeness and provenance. Once we have tracked down the cause, it would be possible to regenerate the subset of affected healpixels.

boutigny commented 2 years ago

@evevkovacs Here is the list of problematic healpixels: skysim5000_v1.1.1_healpix6093.parquet skysim5000_v1.1.1_healpix6087.parquet skysim5000_v1.1.1_healpix6107.parquet skysim5000_v1.1.1_healpix6220.parquet skysim5000_v1.1.1_healpix6087.parquet skysim5000_v1.1.1_healpix6093.parquet skysim5000_v1.1.1_healpix6107.parquet skysim5000_v1.1.1_healpix6220.parquet skysim5000_v1.1.1_healpix6465.parquet skysim5000_v1.1.1_healpix6483.parquet skysim5000_v1.1.1_healpix6747.parquet skysim5000_v1.1.1_healpix6848.parquet skysim5000_v1.1.1_healpix6870.parquet skysim5000_v1.1.1_healpix7491.parquet skysim5000_v1.1.1_healpix7504.parquet skysim5000_v1.1.1_healpix7641.parquet skysim5000_v1.1.1_healpix7755.parquet skysim5000_v1.1.1_healpix7756.parquet skysim5000_v1.1.1_healpix7895.parquet skysim5000_v1.1.1_healpix7897.parquet skysim5000_v1.1.1_healpix8287.parquet skysim5000_v1.1.1_healpix9813.parquet skysim5000_v1.1.1_healpix9036.parquet skysim5000_v1.1.1_healpix9284.parquet skysim5000_v1.1.1_healpix9809.parquet skysim5000_v1.1.1_healpix10176.parquet skysim5000_v1.1.1_healpix10675.parquet skysim5000_v1.1.1_healpix11296.parquet skysim5000_v1.1.1_healpix11297.parquet skysim5000_v1.1.1_healpix11377.parquet skysim5000_v1.1.1_healpix11456.parquet skysim5000_v1.1.1_healpix11457.parquet skysim5000_v1.1.1_healpix11458.parquet skysim5000_v1.1.1_healpix11459.parquet skysim5000_v1.1.1_healpix8395.parquet skysim5000_v1.1.1_healpix9161.parquet skysim5000_v1.1.1_healpix9164.parquet skysim5000_v1.1.1_healpix9291.parquet skysim5000_v1.1.1_healpix9553.parquet skysim5000_v1.1.1_healpix9679.parquet skysim5000_v1.1.1_healpix9937.parquet skysim5000_v1.1.1_healpix10665.parquet skysim5000_v1.1.1_healpix8032.parquet skysim5000_v1.1.1_healpix8288.parquet skysim5000_v1.1.1_healpix9551.parquet skysim5000_v1.1.1_healpix9935.parquet skysim5000_v1.1.1_healpix10076.parquet skysim5000_v1.1.1_healpix10674.parquet skysim5000_v1.1.1_healpix10884.parquet skysim5000_v1.1.1_healpix10904.parquet skysim5000_v1.1.1_healpix11374.parquet skysim5000_v1.1.1_healpix11376.parquet skysim5000_v1.1.1_healpix8667.parquet skysim5000_v1.1.1_healpix9792.parquet skysim5000_v1.1.1_healpix11382.parquet skysim5000_v1.1.1_healpix12179.parquet skysim5000_v1.1.1_healpix8543.parquet

Regarding baseDC2/source_halo_mvir, this is also a dtype mismatch in the files corresponding to the 3 redshift intervals