Catalog import: Option to run a single stage

delucchi-cmu commented 1 month ago

Feature request

Add a new argument to specify the stages of the pipeline to run. If not specified, just run the whole thing.

e.g. run_stages=['mapping', 'splitting']

Before submitting Please check the following:

[ ] I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
[ ] I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
[ ] If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

johnny-gossick commented 1 month ago

For static datasets, a hipscat import job may need to be run only once so the total runtime of the job is not a major concern. However, some datasets will receive regular updates so we may need to run hipscat import on either the full dataset or just the new data (if possible) on a regular schedule. When dealing with frequently updated catalog data, it will become very important to reduce the runtime of a hipscat job by increasing the parallelism and defining the ideal worker configuration for each stage of the job. It will also be important to automate these regular hipscat import jobs to reduce manual effort.

Different stages of the pipeline could benefit from having different memory amounts, cpu cores, and numbers of dask workers. For example, the optimum worker configuration for the mapping stage of an import job with 32 input files might be 32 workers with one core each and X GB of memory because only 32 tasks would be created. For later stages, there may be much larger task numbers which could benefit from a larger number of dask workers. In general, some stages may require different amounts of cpu cores or memory.

If hipscat import has the option to choose which stage(s) of the import pipeline to run then we may be able to use dask gateway in an autoscaling Kubernetes cluster to dynamically spin up different types and numbers of dask workers (up to a set limit) for each stage of the hipscat pipeline. This could dramatically decrease the total runtime of each import job which would allow us to automate these jobs and run them on a regular schedule.

delucchi-cmu commented 1 month ago

@troyraen I think this would have also helped you in the past with performing manual verification in-between stages of the import pipeline.

troyraen commented 1 month ago

Yes ➕1 for this. I'm less concerned about manual verification between steps now with your recent updates @delucchi-cmu (thank you!) and my growing familiarity with the pipeline (trusting it to fail appropriately). The big benefit for me would be the ability to use different cluster configurations for different stages, like @johnny-gossick said. In some cases, certain stages like "mapping" will overload our filesystem if there's too many workers but other steps can use a lot more workers. I end up copying the main run function into a separate file so that I can scale the cluster between steps.

astronomy-commons / hipscat-import

Catalog import: Option to run a single stage #361