Closed laldoroty closed 1 month ago
I'm sure there's a better way to do this, but right now, it's easy to assign one task per template image. Not sure how to input the number of template images and then have it set the number of tasks based on that, so for now the number of tasks in the *.sh file is the number of template images.
Updated the .sh file for multiprocessing and preparing for batch submission. It appears to be running, with images successfully subtracting, however: getting a bunch of OSError: [Errno 116] Stale file handle
. Also need to figure out why it's printing the output from get_template() more than once (ntasks number of times) per output log, because it doesn't actually initialize a multiprocessing pool until after I use get_templates(). Here's the Stale file error:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 242, in run
skysubimgpath, ddimgpath = sfft(ra,dec,band,sci_pointing,sci_sca,t_pointing,t_sca,verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 116, in sfft
sci_skysub_path = sky_subtract(band=band,pointing=sci_pointing,sca=sci_sca)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/phrosty/phrosty/imagesubtraction.py", line 76, in sky_subtract
gz_and_ext(original_imgpath, decompressed_path)
File "/hpc/group/cosmology/lna18/phrosty/phrosty/imagesubtraction.py", line 56, in gz_and_ext
with gzip.open(in_path,'rb') as f_in, open(out_path,'wb') as f_out:
OSError: [Errno 116] Stale file handle
"""
It happened three times at the beginning of the log and hasn't appeared again since.
It also prints the same science/template pairs at the beginning of the file. So, something is getting parallelized where it shouldn't, and I think SLURM is confused.
Deleted ntasks as a slum input, using cpus-per-task only, this fixed both above issues (repetition and stale file handle).
Last night, all bands ran for ~9 hours and then failed with this error:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 242, in run
skysubimgpath, ddimgpath = sfft(ra,dec,band,sci_pointing,sci_sca,t_pointing,t_sca,verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 116, in sfft
sci_skysub_path = sky_subtract(band=band,pointing=sci_pointing,sca=sci_sca)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/phrosty/phrosty/imagesubtraction.py", line 76, in sky_subtract
gz_and_ext(original_imgpath, decompressed_path)
File "/hpc/group/cosmology/lna18/phrosty/phrosty/imagesubtraction.py", line 61, in gz_and_ext
newhdu.writeto(out_path, overwrite=True)
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/site-packages/astropy/io/fits/hdu/hdulist.py", line 1032, in writeto
fileobj = _File(fileobj, mode=mode, overwrite=overwrite)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/site-packages/astropy/io/fits/file.py", line 218, in __init__
self._open_filename(fileobj, mode, overwrite)
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/site-packages/astropy/io/fits/file.py", line 630, in _open_filename
self._overwrite_existing(overwrite, None, True)
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/site-packages/astropy/io/fits/file.py", line 511, in _overwrite_existing
os.remove(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/work/lna18/imsub_out/unzip/Roman_TDS_simple_model_Y106_35193_8.fits'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 341, in <module>
parse_and_run()
File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 337, in parse_and_run
multiproc_run(args.oid, args.band, args.verbose)
File "/hpc/group/cosmology/lna18/rsim_photometry/sfft_and_animate.py", line 276, in multiproc_run
process = pool.map(partialfunc, templates)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpc/group/cosmology/lna18/miniconda3/envs/repeatability/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '/work/lna18/imsub_out/unzip/Roman_TDS_simple_model_Y106_35193_8.fits'
srun: error: dcc-core-33: task 0: Exited with exit code 1
This file does exist. It's a science image. I think it is possible that the multiprocessing was trying to touch the same file at the same time in two threads?
I think I should take the step where it does sky subtraction out of the parallelization, or at least do this step separately from the other multiprocessing pool. These are the only images that are used multiple times, so we really shouldn't be going through them along with the rest of the SFFT steps.
Did the above. It seems to be running much better now. A few notes from the modifications I made:
skysub_dir
keyword argument in sfft_and_animate.sfft(). This is because I was trying to use functools.partial
to create the process pool, but it was misbehaving with the combination of positional arguments and more than one keyword argument. Should figure this out eventually and fix. --cpus-per-task
is no longer the number of templates. These can now be specified separately. Okay. Running into an issue where the call to Customized_Packet.CP in SFFT eats 150 GB CPU memory per process. Have discussed and debugged extensively with @rknop. Working to move to NERSC and to use the GPU backend for SFFT. This will involve refactoring, probably into three different parts--a CPU preprocessing part of the code, then a GPU part for the SFFT portion, and then CPUs again for post-processing.
Notes from Rob:
On Perlmutter, you get something like 60 CPUs with each GPU. (If memory serves. Might be 30. Or 32.) If you're on a shared GPU node, then you want to request 1 gpu, 32 (I think) CPUs, and m/4 memory (where m is the memory of one node). If you're going for an exclusive GPU node, then you'll get 4 GPUs, which could be good, but management of that is more complicated. https://docs.nersc.gov/systems/perlmutter/architecture/ GPU nodes have 4 GPUs, 1 CPU with 64 cores. So ask for 1 GPU and 16 CPUs (I think that's 16 tasks per node or some such). They have 256GB of RAM, so ask for 64GB of memory. (That will limit how many tasks you can run; you'll have to figure out how much memory the CPU parts of your pipeline use.) https://docs.nersc.gov/jobs/policy/
Plan is changing. Refactoring to be separate files:
# Step 1: Get all images the object is in.
python -u get_object_instances.py "$sn"
# Step 2: Sky subtract, align images to be in DIA.
# WAIT FOR COMPLETION.
# Step 3: Get, align, save PSFs; cross-convolve.
srun python -u preprocess.py [arguments]
# WAIT FOR COMPLETION.
# Step 4: Differencing (GPU).
srun python -u sfftdiff.py [arguments]
# WAIT FOR COMPLETION.
# Step 5: Generate decorrelation kernel, apply to diff. image and science image, make stamps.
srun python -u postprocess.py [arguments]
The loop can be easily divided so it runs faster in however many chunks you want. No reason for this code to take 6 hrs.