Ensure columns in merged_cat are in same order to take advantage of native merge

yymao commented 4 years ago

Currently make_object_catalog.py would generate one parquet file for each patch to allow easy embarrassingly parallelization. After all the patches are process, one has to run merge_parquet_files.py to merge all patches in the same tract into one single parquet file.

In merge_parquet_files.py, the merge is done by loading all parquet files into memory as pandas DataFrame, and then use pd.concat to join them. This is necessary because the columns in different patches are not necessarily in the same order.

So why different patches do not have the columns in the same order? This is because different patches have different sets of available filters, and the current implementation appends the missing filters at the end of the columns.

To address these issues, this PR makes two changes:

Ensure the column orders in different patches are always the same. This change is implemented in make_object_catalog.py
Use pyarrow native method for appending parquet files when we know the schema is consistent. This change is implemented in merge_parquet_files.py.

Note that for the second change, I've only implemented the pyarrow case. There may be an equivalent method for fastparquet but I haven't looked into it.

I have also not tested these two updated scripts because I don't have the machinery set up yet -- @johannct I'd very much appreciate if you can help with testing them!

yymao commented 4 years ago

Thanks for testing it out @johannct. Did you use --assume-consistent-schema when running the merge script on command line? If yes, that's great (it means the new merge code works). If not, that's also great (it means that aligning the columns will speed up pandas.concat as well).

As for my comment about fastparquet --- that was just a factual statement about I didn't implement fastparquet counterpart; not an opinion of whether we should implement it. Personally I think removing fastparquet support is fine. As long as our own production is ok with pyarrow, reducing unnecessary code is a good idea as it reduces the maintenance need.

johannct commented 4 years ago

Thanks for testing it out @johannct. Did you use --assume-consistent-schema when running the merge script on command line? If yes, that's great (it means the new merge code works). If not, that's also great (it means that aligning the columns will speed up pandas.concat as well).

I did not use the --assume-consistent-schema option

As for my comment about fastparquet --- that was just a factual statement about I didn't implement fastparquet counterpart; not an opinion of whether we should implement it. Personally I think removing fastparquet support is fine. As long as our own production is ok with pyarrow, reducing unnecessary code is a good idea as it reduces the maintenance need.

+1 then

johannct commented 4 years ago

while we are at it, is there any news on the chunk issue raised by some linear combination of Joe, Francois and Sam?

johannct commented 4 years ago

@yymao there is an error when I switch on the --assume-consistent-schema option :+1:

(lsst-scipipe-4d7b902) /sps/lsst/dataproducts/desc/DC2/Run2.2i/v18.1.0-dev-v5(0)>python ${DC2PROD}/merge_parquet_files.py ${DPDDDIR}/object${TRACT}_*.parquet -o ${DPDDDIR}/object${TRACT}_2.parquet --assume-consistent-schema Traceback (most recent call last): File "/pbs/throng/lsst/software/desc/DC2/Run2.2i/v5/DC2-production/scripts/merge_parquet_files.py", line 82, in run(**vars(args)) File "/pbs/throng/lsst/software/desc/DC2/Run2.2i/v5/DC2-production/scripts/merge_parquet_files.py", line 39, in run pqwriter.write_table(t) File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2019_44/python/miniconda3-4.5.12/envs/lsst-scipipe-4d7b902/lib/python3.7/site-packages/pyarrow/parquet.py", line 407, in write_table raise ValueError(msg) ValueError: Table schema does not match schema used to create file: ...

JulienPeloton commented 4 years ago

@johannct this can happen if the metadata from the files are different (for old version of pyarrow). What if you use t.schema.remove_metadata() instead of t.schema in line 35 (in with pq.ParquetWriter(...))?

Other question: which version of pyarrow is used here?

yymao commented 4 years ago

@johannct Thanks for the report! Are you running merge_parquet_files.py with --assume-consistent-schema on the new patch files that were generated with the updated make_object_catalog.py in thss PR, or on some older patch files that were generated with make_object_catalog.py in current master?

If the latter then the error was expected (since the columns are not in the same order). If the former then we'll need to look into it...

johannct commented 4 years ago

@johannct Thanks for the report! Are you running merge_parquet_files.py with --assume-consistent-schema on the new patch files that were generated with the updated make_object_catalog.py in thss PR, or on some older patch files that were generated with make_object_catalog.py in current master?

If the latter then the error was expected (since the columns are not in the same order). If the former then we'll need to look into it...

I ran merge on the recently produced patches files, that have been produced with the new codes, and for which I checked that some files are indeed different from a previous run produced before your pull request

johannct commented 4 years ago

@johannct this can happen if the metadata from the files are different (for old version of pyarrow). What if you use t.schema.remove_metadata() instead of t.schema in line 35 (in with pq.ParquetWriter(...))?

Other question: which version of pyarrow is used here?

nope that still bombs with this modification.

johannct commented 4 years ago

@JulienPeloton files location : /sps/lsst/dataproducts/desc/DC2/Run2.2i/w_2019_44-v5/dpdd/t3828_calexp-v1:t3828_coadd-v1/object_table_summary/

JulienPeloton commented 4 years ago

Thanks @johannct. I inspected the files for which the schema with respect to the schema of the first file is not the same (True/False means same schema or not as the reference file):

reference file: object_3828_00.parquet
object_3828_01.parquet True
object_3828_02.parquet False
object_3828_03.parquet False
object_3828_04.parquet True
object_3828_05.parquet True
object_3828_06.parquet True
object_3828_10.parquet True
object_3828_11.parquet True
object_3828_12.parquet False
object_3828_13.parquet False
object_3828_14.parquet True
object_3828_15.parquet True
object_3828_16.parquet True
object_3828_2.parquet False
object_3828_20.parquet True
object_3828_21.parquet True
object_3828_22.parquet True
object_3828_23.parquet True
object_3828_24.parquet True
object_3828_25.parquet True
object_3828_26.parquet True
object_3828_30.parquet True
object_3828_31.parquet True
object_3828_32.parquet True
object_3828_33.parquet True
object_3828_34.parquet True
object_3828_35.parquet True
object_3828_36.parquet True
object_3828_40.parquet True
object_3828_41.parquet True
object_3828_42.parquet True
object_3828_43.parquet True
object_3828_44.parquet True
object_3828_45.parquet True
object_3828_46.parquet True
object_3828_50.parquet False
object_3828_51.parquet False
object_3828_52.parquet False
object_3828_53.parquet False
object_3828_54.parquet False
object_3828_55.parquet True
object_3828_56.parquet False
object_3828_60.parquet False
object_3828_61.parquet False
object_3828_62.parquet False
object_3828_63.parquet False
object_3828_64.parquet False
object_3828_65.parquet False
object_3828_66.parquet False

So some files have the same schema, others don't. If i now take take two files with different schema, and I inspect all the fields (there are 4512 fields...), they are all the same but 21:

In [41]: import pyarrow.parquet as pq

In [42]: t1 = pq.read_table(os.environ['DPDD_DIR'] + '/object_3828_00.parquet')

In [43]: t2 = pq.read_table(os.environ['DPDD_DIR'] + '/object_3828_02.parquet')

In [44]: noeq = [[t1.schema[index], t2.schema[index]] for index in range(len(t1.schema)) if t1.schema[index] != t2.schema[index]]

In [45]: noeq
Out[45]:
[[pyarrow.Field<u_deblend_nChild: int32>,
  pyarrow.Field<u_deblend_nChild: int64>],
 [pyarrow.Field<u_base_SdssCentroid_xErr: float>,
  pyarrow.Field<u_base_SdssCentroid_xErr: double>],
 [pyarrow.Field<u_base_SdssCentroid_yErr: float>,
  pyarrow.Field<u_base_SdssCentroid_yErr: double>],
 [pyarrow.Field<u_base_InputCount_value: int32>,
  pyarrow.Field<u_base_InputCount_value: int64>],
 [pyarrow.Field<u_base_SdssShape_xxErr: float>,
  pyarrow.Field<u_base_SdssShape_xxErr: double>],
 [pyarrow.Field<u_base_SdssShape_yyErr: float>,
  pyarrow.Field<u_base_SdssShape_yyErr: double>],
 [pyarrow.Field<u_base_SdssShape_xyErr: float>,
  pyarrow.Field<u_base_SdssShape_xyErr: double>],
 [pyarrow.Field<u_base_SdssShape_instFlux_xx_Cov: float>,
  pyarrow.Field<u_base_SdssShape_instFlux_xx_Cov: double>],
 [pyarrow.Field<u_base_SdssShape_instFlux_yy_Cov: float>,
  pyarrow.Field<u_base_SdssShape_instFlux_yy_Cov: double>],
 [pyarrow.Field<u_base_SdssShape_instFlux_xy_Cov: float>,
  pyarrow.Field<u_base_SdssShape_instFlux_xy_Cov: double>],
 [pyarrow.Field<u_base_PsfFlux_area: float>,
  pyarrow.Field<u_base_PsfFlux_area: double>],
 [pyarrow.Field<u_slot_PsfFlux_area: float>,
  pyarrow.Field<u_slot_PsfFlux_area: double>],
 [pyarrow.Field<u_ext_photometryKron_KronFlux_radius: float>,
  pyarrow.Field<u_ext_photometryKron_KronFlux_radius: double>],
 [pyarrow.Field<u_ext_photometryKron_KronFlux_radius_for_radius: float>,
  pyarrow.Field<u_ext_photometryKron_KronFlux_radius_for_radius: double>],
 [pyarrow.Field<u_ext_photometryKron_KronFlux_psf_radius: float>,
  pyarrow.Field<u_ext_photometryKron_KronFlux_psf_radius: double>],
 [pyarrow.Field<u_ext_convolved_ConvolvedFlux_seeing: float>,
  pyarrow.Field<u_ext_convolved_ConvolvedFlux_seeing: double>],
 [pyarrow.Field<u_undeblended_base_PsfFlux_area: float>,
  pyarrow.Field<u_undeblended_base_PsfFlux_area: double>],
 [pyarrow.Field<u_undeblended_ext_photometryKron_KronFlux_radius: float>,
  pyarrow.Field<u_undeblended_ext_photometryKron_KronFlux_radius: double>],
 [pyarrow.Field<u_undeblended_ext_photometryKron_KronFlux_radius_for_radius: float>,
  pyarrow.Field<u_undeblended_ext_photometryKron_KronFlux_radius_for_radius: double>],
 [pyarrow.Field<u_undeblended_ext_photometryKron_KronFlux_psf_radius: float>,
  pyarrow.Field<u_undeblended_ext_photometryKron_KronFlux_psf_radius: double>],
 [pyarrow.Field<u_undeblended_ext_convolved_ConvolvedFlux_seeing: float>,
  pyarrow.Field<u_undeblended_ext_convolved_ConvolvedFlux_seeing: double>]]

we can see our friends int32 vs int64 and float vs double... So I guess @johannct @yymao @wmwv the problem comes from the way the input files are generated. I'll let you judge!

johannct commented 4 years ago

thanks @JulienPeloton so the code is scripts/make_object_catalog.py in DC2-production, in case you see something we did not catch yet :)

yymao commented 4 years ago

thanks @JulienPeloton this is really helpful. I have a guess of what might be wrong. Will look into it...

Yao

On Tue, Nov 26, 2019, 4:18 PM Johann Cohen-Tanugi notifications@github.com wrote:

thanks @JulienPeloton https://github.com/JulienPeloton so the code is scripts/make_object_catalog.py in DC2-production, in case you see something we did not catch yet :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2-production/pull/378?email_source=notifications&email_token=AA456E7HLFNI2VFJN3MRCADQVWHC3A5CNFSM4JROVAO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFHO7HI#issuecomment-558821277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA456E7AFOKUEWKBWEYVVYDQVWHC3ANCNFSM4JROVAOQ .

JulienPeloton commented 4 years ago

Thanks @yymao for inspecting

@johannct I ran the code at cc with the same error as you ;-) edit: thanks for the script to generate files...!

johannct commented 4 years ago

trying out a patch from @yymao . Time to remake the patch files

johannct commented 4 years ago

the last commit from @yymao did it : the crash with --assume-consistent-schema disappears and the merge completes. @JulienPeloton : /sps/lsst/dataproducts/desc/DC2/Run2.2i/w_2019_44-v5/dpdd/t3828_calexp-v1\:t3828_coadd-v1/object_table_summary (the former one has been moved to object_table_summary_almost instead :) )

johannct commented 4 years ago

I just tested that all the schema of the new files are identical. I favor the removal of fastparquet. @wmwv can you comment on that, and bless the pull request please?

wmwv commented 4 years ago

Yes, we should drop support for fastparquet. The Project will be using pyarrow.

LSSTDESC / DC2-production

Ensure columns in merged_cat are in same order to take advantage of native merge #378