OutThere-JWST / NISPureParallel

NIRISS Pure Parallel Processing Scripts
GNU General Public License v3.0
2 stars 1 forks source link

Automated Pipeline of NIRISS WFSS Pure Parallel Data

This repository contains to process all of the available JWST NIRISS Pure Parallel observations from PASSAGE (1571) and OutThere (3383 + 4681). It uses an Environment Manager pixi to install software combined with a Pythonic workflow manager called Snakemake in order to distribute the parallizeable steps of the workflow and can be scaled to cluster environments. In order to run the workflow yourself, you will need to do the following:

  1. Install Pixi
  2. Clone this Repository
  3. Configure Grizli and CRDS
  4. Set Up Pipeline
  5. Pipeline Description
  6. SnakeMake!
  7. Summary

Install Pixi

This pipeline uses pixi as an environment manager. To learn more about pixi, check out their documentation but most users can install it with: curl -fsSL https://pixi.sh/install.sh | bash

Clone this Repository

To get started, of course the first step is to clone this repository either over HTTPS or SSH:

git clone git@github.com:OutThere-JWST/NISPureParallel.git

git clone https://github.com/OutThere-JWST/NISPureParallel.git

This repository includes the following:

Configure Grizli and CRDS

Those who have installed grizli and/or the JWST pipeline before may be familiar with the process of installing grizli, configuring the relevant environment variables, and downloading the necessary configuration files. Pixi allows us to make this process easy. Running pixi run setup will automatically install all necessary packages, configure the relevant environment variables, and download the grizli configurations, including the latest NIRISS dispersion solutions and WFSS Backgrounds.

For advanced users who have existing CRDS cache and grizli conf locations, you can configure the defaults in resources/setupShell.sh. By default, these will be downloaded to extra directories within this directory.

Set Up Pipeline

This pipeline processes data in fields which are computed in the following way:

If you are running this on a cluster (or even on a powerful server) you will likely need to write a Snakemake configuration profile for the workflow. You can see examples of these in the profiles/ directory. Please get in touch if you need one made for your environment.

Essentially this allows Snakemake to submit your SLURM jobs on your behalf. As such it will require basic SLURM information, such as the number of available CPUs per node and the number of jobs Snakemake can submit at once.

In addition, if the packages necessary for Snakemake are updated, you will need to clean the Snakemake cache: pixi run clean

Pipeline Description

The following is a brief description of Snakemake and the pipeline developed in this repository. All of the relevant scripts are contained within the workflow directory. Snakemake begins by reading the Snakefile which describes the rules necessary to create output files based on input files. Snakemake then computes the DAG necessary to create all requested output files and distributes them in parallel where possible. Each rule requires inputs, outputs, and code necessary to produce inputs from outputs. Here I briefly describe the stages of our pipeline:

0) Download Data. The UNCAL images for the field in question is downloaded. If the data exist, it will not re-download them. The download is parallelized.

1) Stage1 Processing. Here we convert UNCAL files for NIRISS into RATE files. Here we use the latest version of the JWST pipeline and implement the ColumnJump step which can offer improvements for NIRISS fields. In addition, we do some 1/f correction in Stage 1 for Imaging.

2) Preprocessing. Here the data is pre-processed using grizli. In a nutshell, a modified version of the Stage2 pipeline is applied, and dithered frames are drizzled together.

3) Mosaicing. Frames in the same field are mosaiced together and a detection catalog is measured on an IR image made from combining all available direct filters.

4) Contamination Modelling. Based on the direct image and detection catalog, we create a model of the dispersed image that can be used to model the contamination from overlapping traces.

5) Spectral Extraction. 2D Spectra are extracted from the dispersed images.

6) Redshift Fitting.

7) FITSMap. In the final step, the products are collected and a FITSMap is created highlighting all of the results of the data analysis.

Snakemake!

Now we just have to type one command and watch as our worflow is distributed over our cluster: pixi run snakemake --workflow-profile profiles/your-profile And that's it! Snakemake will distribute this workflow over the clusters for you! Two additional flags that you should be aware of:

If you just want to run a single field, simply provide the relevant field output file as input to the command: snakemake FIELDS/leo-00/logs/fmap.log.

If you've edited a step of the pipeline and want to force it to re-run just for that step, you can delete the relevant logfile from the log directory.

Note: For cluster users, you will need to make sure that the required environment modules are loaded in your bashrc/profile. There should be a way to do it with snakemake but I haven't managed to get it to work yet. Importantly, conda and TeX are needed for running this pipeline, which are available to the MPIA Cluster as modules.

Note: If you get tired of prepending pixi run each time, you can simply do pixi shell to launch a shell within the environment (much like conda activate).

Summary

With just four commands we can process the entire pipeline:

git clone git@github.com:OutThere-JWST/NISPureParallel.git
pixi run setup
pixi run computeFields
pixi run snakemake

In practice, we likely want to customize the final step so as to activate a cluster profile if running over the entire dataset, or specify a specific field to process.