add script to generate object catalog in parquet format

LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.

BSD 3-Clause "New" or "Revised" License

11 stars 7 forks source link

add script to generate object catalog in parquet format #358

Closed yymao closed 5 years ago

yymao commented 5 years ago

This PR adds a script to generate object catalog in parquet format. This new script is called generate_object_catalog.py and replaces the functionality of merge_tract_cat.py.

This is now a draft PR. The function merge_coadd_forced_src has been tested, but the CLI interface has not been tested. Also, the magnitude extraction part still needs to be fixed.

This PR will fix #342.

yymao commented 5 years ago

OK, this is now working and ready for review! Here's an example of how to run several patches:

python generate_object_catalog.py <REPO> 4850 --patches=00,01,02,03 --output_dir=<OUTPUT_DIR>

Here's a real case on NERSC:

python generate_object_catalog.py /global/cscratch1/sd/desc/DC2/data/Run2.1i/rerun/coadd-v1 4850 --patches=00,01 --output_dir=$SCRATCH --verbose

cc @johannct

wmwv commented 5 years ago

I'd like to call this script make_object_catalog.py? Shorter with same sense.

johannct commented 5 years ago

Sorry @yymao to have been unclear when we chatted. The standard patch syntax is X,Y and if there are several patches, then using the stack nomenclature may be in order : in your example above 0,0^0,1^0,2^0,3 rather than 00,01,02,03

johannct commented 5 years ago

If this goes all the way to parquet files, is it ok to run it several times with the same tract but disjoint sets of patches in order to build the full catalogue?

johannct commented 5 years ago

[tanugi@cca001 scripts]$ git diff make_object_catalog.py diff --git a/scripts/make_object_catalog.py b/scripts/make_object_catalog.py index 6cb5c1c..edb02f3 100644 --- a/scripts/make_object_catalog.py +++ b/scripts/make_object_catalog.py @@ -57,13 +57,12 @@ def generate_object_catalog(output_dir, butler, tract, patches=None, patches = ['%d,%d' % patch.getIndex() for patch in skymap[tract]] else: try: - patches = patches.split(',') + patches = patches.split('^') except AttributeError: pass else: - if not all(len(p) == 2 for p in patches): + if not all(len(p) == 3 for p in patches): raise ValueError('patches should be a list or a string in "11,22,33" format') - patches = ['{},{}'.format(*p) for p in patches] \ for patch in patches: if verbose:

yymao commented 5 years ago

Thanks @johannct, I've updated the format for specifying patches as suggested.

You asked:

If this goes all the way to parquet files, is it ok to run it several times with the same tract but disjoint sets of patches in order to build the full catalogue? Yes, each patch will have its own output, so you can run disjoint sets of patches in parallel. Once everything is done, we need to run another script to join the patches in each tract.

johannct commented 5 years ago

ok then for now I can also run on tracts only, that will do away with the second script. Is there another step? I tested the script successfully by the way.

yymao commented 5 years ago

Right, so there will be two steps:

First, generate per-patch files:

REPO=/path/to/butler/repo
TRACT=4850
OUTPUTDIR=/path/to/output_dir

python make_object_catalog.py $REPO $TRACT --patches='0,0^0,1^0,2' --output-dir=$OUTPUTDIR
python make_object_catalog.py $REPO $TRACT --patches='1,0^1,1^1,2' --output-dir=$OUTPUTDIR

And after all patches in this tract is done, then run:

python merge_parquet_files.py $OUTPUTDIR/object_$TRACT_*.parquet -o=$OUTPUTDIR/object_tract_$TRACT.parquet --sort-input-files

EiffL commented 5 years ago

??? Why do you have to generate per patch object catalogs ?? Michael's script were already automatically finding and extracting the object catalogs from the patches using butler magic

yymao commented 5 years ago

@EiffL because we may want to parallelize the patches

johannct commented 5 years ago

keep homogeneity with DM : --patch instead of --patches.

wmwv commented 5 years ago

@johannct Because of Python's unique-prefix rules for processing options, you can still use --patch instead of --patches.

We use --patches in merge_dia_object.py. We can change both to --patch in some future PR if this ends up being confusing.