Closed yymao closed 3 years ago
In the last few commits, I implemented a checkpoint feature for scripts/write_gcr_to_parquet.py
. If an user calls this script with --checkpoint-dir=<path>
, then the script will write checkpoint files in
The updates to scripts/write_gcr_to_parquet.py
have been tested. This PR is ready for review.
Thanks @johannct. I've addressed your comments. Do you plan to further review the other two scripts or do you think this is ready for merge?
I reviewed them. Is tqdm really necessary? If it is already included in the DESC environment I guess it is fine anyway. I could not follow exactly the Checkpoints design but I trust you tested it.
Yes, tqdm
is in desc-python
-- it is useful to see the progress bar when running these scripts in interactive session.
re: checkpoint, yep I have tested it. In fact I have used these scripts for generating several data sets, but a formal review is nonetheless helpful.
I'll be merging this. Thanks again, @johannct!
This PR enables three scripts to process multiple files at once:
scripts/repartition_into_tracts.py
: Multiple input files are processed in serial, significantly reducing the overhead of starting python and loading skymap. Note that for each file, multiple cores are already used in the calculation of tract and patch.scripts/merge_truth_per_tract.py
: Multiple input files can be processed in serial or parallel, depending on--n-cores
.scripts/write_gcr_to_parquet.py
: Multiple input files are processed in serial, reducing the overhead of starting python and loading catalog metadata. Python multiprocess is not used here because some catalogs in GCRCatalogs are not pickleable.I have already tested all three scripts.