Add multiprocessing to sfft_and_animate.run.

laldoroty commented 2 months ago

The loop can be easily divided so it runs faster in however many chunks you want. No reason for this code to take 6 hrs.

laldoroty commented 2 months ago

I'm sure there's a better way to do this, but right now, it's easy to assign one task per template image. Not sure how to input the number of template images and then have it set the number of tasks based on that, so for now the number of tasks in the *.sh file is the number of template images.

laldoroty commented 2 months ago

Updated the .sh file for multiprocessing and preparing for batch submission. It appears to be running, with images successfully subtracting, however: getting a bunch of OSError: [Errno 116] Stale file handle. Also need to figure out why it's printing the output from get_template() more than once (ntasks number of times) per output log, because it doesn't actually initialize a multiprocessing pool until after I use get_templates(). Here's the Stale file error:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 242, in run
    skysubimgpath, ddimgpath = sfft(ra,dec,band,sci_pointing,sci_sca,t_pointing,t_sca,verbose=verbose)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 116, in sfft
    sci_skysub_path = sky_subtract(band=band,pointing=sci_pointing,sca=sci_sca)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/phrosty/phrosty/imagesubtraction.py", line 76, in sky_subtract
    gz_and_ext(original_imgpath, decompressed_path)
  File "/hpc/group/cosmology/lna18/phrosty/phrosty/imagesubtraction.py", line 56, in gz_and_ext
    with gzip.open(in_path,'rb') as f_in, open(out_path,'wb') as f_out:
OSError: [Errno 116] Stale file handle
"""

It happened three times at the beginning of the log and hasn't appeared again since.

It also prints the same science/template pairs at the beginning of the file. So, something is getting parallelized where it shouldn't, and I think SLURM is confused.

laldoroty commented 2 months ago

Deleted ntasks as a slum input, using cpus-per-task only, this fixed both above issues (repetition and stale file handle).

laldoroty commented 2 months ago

Last night, all bands ran for ~9 hours and then failed with this error:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 242, in run
    skysubimgpath, ddimgpath = sfft(ra,dec,band,sci_pointing,sci_sca,t_pointing,t_sca,verbose=verbose)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 116, in sfft
    sci_skysub_path = sky_subtract(band=band,pointing=sci_pointing,sca=sci_sca)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/phrosty/phrosty/imagesubtraction.py", line 76, in sky_subtract
    gz_and_ext(original_imgpath, decompressed_path)
  File "/hpc/group/cosmology/lna18/phrosty/phrosty/imagesubtraction.py", line 61, in gz_and_ext
    newhdu.writeto(out_path, overwrite=True)
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/site-packages/astropy/io/fits/hdu/hdulist.py", line 1032, in writeto
    fileobj = _File(fileobj, mode=mode, overwrite=overwrite)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/site-packages/astropy/io/fits/file.py", line 218, in __init__
    self._open_filename(fileobj, mode, overwrite)
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/site-packages/astropy/io/fits/file.py", line 630, in _open_filename
    self._overwrite_existing(overwrite, None, True)
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/site-packages/astropy/io/fits/file.py", line 511, in _overwrite_existing
    os.remove(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/work/lna18/imsub_out/unzip/Roman_TDS_simple_model_Y106_35193_8.fits'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 341, in <module>
    parse_and_run()
  File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 337, in parse_and_run
    multiproc_run(args.oid, args.band, args.verbose)
  File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 276, in multiproc_run
    process = pool.map(partialfunc, templates)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '/work/lna18/imsub_out/unzip/Roman_TDS_simple_model_Y106_35193_8.fits'
srun: error: dcc-core-33: task 0: Exited with exit code 1

This file does exist. It's a science image. I think it is possible that the multiprocessing was trying to touch the same file at the same time in two threads?

laldoroty commented 2 months ago

I think I should take the step where it does sky subtraction out of the parallelization, or at least do this step separately from the other multiprocessing pool. These are the only images that are used multiple times, so we really shouldn't be going through them along with the rest of the SFFT steps.

laldoroty commented 2 months ago

Did the above. It seems to be running much better now. A few notes from the modifications I made:

I fixed the skysub_dir keyword argument in sfft_and_animate.sfft(). This is because I was trying to use functools.partial to create the process pool, but it was misbehaving with the combination of positional arguments and more than one keyword argument. Should figure this out eventually and fix.
I moved the stamp-making step inside sfft_and_animate.sfft() because it made sense with the new parallelization stuff to me. Seemed like it should just be a quick step at the end of DIA. No reason to make a new process pool for such a small thing, or a loop, etc.
The code for animation is no longer in run() because I have to rework it eventually to rotate the stamps all to the same WCS anyway.
--cpus-per-task is no longer the number of templates. These can now be specified separately.

laldoroty commented 2 months ago

Okay. Running into an issue where the call to Customized_Packet.CP in SFFT eats 150 GB CPU memory per process. Have discussed and debugged extensively with @rknop. Working to move to NERSC and to use the GPU backend for SFFT. This will involve refactoring, probably into three different parts--a CPU preprocessing part of the code, then a GPU part for the SFFT portion, and then CPUs again for post-processing.

Notes from Rob:

On Perlmutter, you get something like 60 CPUs with each GPU. (If memory serves. Might be 30. Or 32.) If you're on a shared GPU node, then you want to request 1 gpu, 32 (I think) CPUs, and m/4 memory (where m is the memory of one node). If you're going for an exclusive GPU node, then you'll get 4 GPUs, which could be good, but management of that is more complicated. https://docs.nersc.gov/systems/perlmutter/architecture/ GPU nodes have 4 GPUs, 1 CPU with 64 cores. So ask for 1 GPU and 16 CPUs (I think that's 16 tasks per node or some such). They have 256GB of RAM, so ask for 64GB of memory. (That will limit how many tasks you can run; you'll have to figure out how much memory the CPU parts of your pipeline use.) https://docs.nersc.gov/jobs/policy/

laldoroty commented 2 months ago

Plan is changing. Refactoring to be separate files:

# Step 1: Get all images the object is in.
python -u get_object_instances.py "$sn"
# Step 2: Sky subtract, align images to be in DIA. 
# WAIT FOR COMPLETION. 
# Step 3: Get, align, save PSFs; cross-convolve. 
srun python -u preprocess.py [arguments]
# WAIT FOR COMPLETION.
# Step 4: Differencing (GPU). 
srun python -u sfftdiff.py [arguments]
# WAIT FOR COMPLETION.
# Step 5: Generate decorrelation kernel, apply to diff. image and science image, make stamps.
srun python -u postprocess.py [arguments]

Roman-Supernova-PIT / diff-img

Add multiprocessing to sfft_and_animate.run. #24