LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Enable processing multiple files at once #407

Closed yymao closed 3 years ago

yymao commented 3 years ago

This PR enables three scripts to process multiple files at once:

I have already tested all three scripts.

yymao commented 3 years ago

In the last few commits, I implemented a checkpoint feature for scripts/write_gcr_to_parquet.py. If an user calls this script with --checkpoint-dir=<path>, then the script will write checkpoint files in , allowing user to see which file is currently being generated and which files have been completed. This feature enables running this script on flex queue.

yymao commented 3 years ago

The updates to scripts/write_gcr_to_parquet.py have been tested. This PR is ready for review.

yymao commented 3 years ago

Thanks @johannct. I've addressed your comments. Do you plan to further review the other two scripts or do you think this is ready for merge?

johannct commented 3 years ago

I reviewed them. Is tqdm really necessary? If it is already included in the DESC environment I guess it is fine anyway. I could not follow exactly the Checkpoints design but I trust you tested it.

yymao commented 3 years ago

Yes, tqdm is in desc-python -- it is useful to see the progress bar when running these scripts in interactive session.

re: checkpoint, yep I have tested it. In fact I have used these scripts for generating several data sets, but a formal review is nonetheless helpful.

I'll be merging this. Thanks again, @johannct!