Excessive memory usage when using source grouping

keflavich commented 2 months ago

I have an MWE that fails reliably now. It's not all that minimal, but minimal enough I hope.

import numpy as np
from photutils.psf import PSFPhotometry, IterativePSFPhotometry, SourceGrouper
from photutils.detection import DAOStarFinder
from photutils.background import LocalBackground

import scipy.ndimage
from astropy.io import fits
from astropy.modeling.fitting import LevMarLSQFitter

from webbpsf.utils import to_griddedpsfmodel

basepath = '/blue/adamginsburg/adamginsburg/jwst/brick/'

im1 = fits.open(f'{basepath}/analysis/MWE_example.fits')
obsdate = im1[0].header['DATE-OBS']
module = 'merged'
data = im1['SCI'].data[:400, :400]
err = im1['ERR'].data[:400, :400]

weight = err**-1
maxweight = np.percentile(weight[np.isfinite(weight)], 95)
minweight = np.percentile(weight[np.isfinite(weight)], 5)
badweight = np.percentile(weight[np.isfinite(weight)], 1)
weight[err < 1e-5] = 0
weight[np.isnan(weight)] = 0
bad = np.isnan(weight) | (data == 0) | np.isnan(data) | (weight == 0) | (err == 0) | (data < 1e-5)

mask = bad

mask = scipy.ndimage.binary_dilation(scipy.ndimage.binary_erosion(mask, iterations=1), iterations=1)
mask = scipy.ndimage.binary_erosion(scipy.ndimage.binary_dilation(mask, iterations=1), iterations=1)

filtered_errest = np.nanmedian(err)

fwhm_pix = 2.302

daofind_tuned = DAOStarFinder(threshold=5 * filtered_errest,
                              fwhm=fwhm_pix, roundhi=1.0, roundlo=-1.0,
                              sharplo=0.30, sharphi=1.40,
                              exclude_border=True
                              )
grouper = SourceGrouper(2 * fwhm_pix)

oversample = 1
proposal_id = 1182
field = '004'
filtername = 'f444w'

blur_ = '_blur'
dao_psf_model = to_griddedpsfmodel(f'{basepath}/psfs/{filtername.upper()}_{proposal_id}_{field}_merged_PSFgrid_oversample{oversample}{blur_}.fits')
dao_psf_model.flux.min = 0

print("Initializing IterativePSFPhotometry")
phot_g_iter = IterativePSFPhotometry(finder=daofind_tuned,
                                     localbkg_estimator=LocalBackground(5, 15),
                                     grouper=grouper,
                                     psf_model=dao_psf_model,
                                     fitter=LevMarLSQFitter(),
                                     maxiters=5,
                                     fit_shape=(5, 5),
                                     aperture_radius=2 * fwhm_pix,
                                     progress_bar=True,
                                     xy_bounds=2,
                                     mode='all',
                                     )

print("Beginning iterations")
result_g_iter = phot_g_iter(data, mask=mask)

This is running on a 400x400 image. It runs out of memory on an 8GB node. I'm running with memory profiler on 16GB, 32GB, 64GB to see what the peak usage is / see if I can get it to complete.

I think peak memory usage is happening somewhere in the n'th iteration.

keflavich commented 2 months ago

Here's the report for 8, 16, 32:

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group:  20%|██        | 402/2007 [00:08<00:37, 42.77it/s]/tmp/slurmd/job36932843/slurm_script: line 4: 152380 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36932843.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group:  42%|████▏     | 847/2007 [00:56<06:04,  3.18it/s]/tmp/slurmd/job36934208/slurm_script: line 4: 3603512 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36934208.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group:  90%|████████▉ | 1803/2007 [00:55<00:06, 33.04it/s]/tmp/slurmd/job36934209/slurm_script: line 4: 25232 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36934209.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

and 64 is still going

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group: 100%|██████████| 2007/2007 [01:02<00:00, 32.20it/s]
Add model sources: 100%|██████████| 3205/3205 [00:04<00:00, 761.94it/s]

keflavich commented 2 months ago

64gb eventually failed:

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group: 100%|██████████| 2007/2007 [01:02<00:00, 32.20it/s]
Add model sources: 100%|██████████| 3205/3205 [00:04<00:00, 761.94it/s]
Fit source/group: 100%|██████████| 1262/1262 [04:09<00:00,  5.07it/s]
Add model sources: 100%|██████████| 6761/6761 [00:09<00:00, 743.57it/s]
Fit source/group:   2%|▏         | 15/783 [00:59<50:28,  3.94s/it]
Traceback (most recent call last):
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/site-packages/memory_profiler.py", line 1351, in <module>
    exec_with_profiler(script_filename, prof, args.backend, script_args)
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/site-packages/memory_profiler.py", line 1252, in exec_with_profiler
    exec(compile(f.read(), filename, 'exec'), ns, ns)
  File "/blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py", line 69, in <module>
    result_g_iter = phot_g_iter(data, mask=mask)
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 1944, in __call__
    new_tbl = self._psfphot(residual_data, mask=mask, error=error,
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 1389, in __call__
    fit_params = self._fit_sources(data, init_params, error=error,
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 1048, in _fit_sources
    yi, xi, cutout = self._define_fit_data(sources_, data, mask)
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 893, in _define_fit_data
    raise ValueError(msg)
ValueError: Source at (55.492494305009046, 10.849163838754908) is completely masked. Remove the source from init_params or correct the input mask.

keflavich commented 1 month ago

That last failure may indicate a problem with the source parameter validation in IterativePSFPhotometry

larrybradley commented 1 month ago

The question is why a source landed on a completely masked region.

keflavich commented 1 month ago

One OOM explanation is the compound models:

models = []
x = 1000000
from tqdm.auto import tqdm
for i in tqdm(range(x)):
    models.append(dao_psf_model.copy())
    if i == 0:
        psf_model = dao_psf_model
    else:
        psf_model += dao_psf_model

(example suggested by Larry on Slack)

gives me

  0%|          | 2303/1000000 [01:02<8:40:40, 31.94it/s]/tmp/slurmd/job36937239/slurm_script: line 4: 155531 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36937239.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

keflavich commented 1 month ago

Confirmed that grouping causes the problem; without the grouper, 8GB works:

Initializing PSFPhotometry (no groups)
Beginning iterations
Fit source/group: 100%|██████████| 3205/3205 [00:20<00:00, 155.58it/s]
WARNING: One or more fit(s) may not have converged. Please check the "flags" column in the output table. [photutils.psf.photometry]
Initializing PSFPhotometry
Beginning iterations
Fit source/group:  20%|█▉        | 400/2007 [00:08<00:38, 41.62it/s]/tmp/slurmd/job36937400/slurm_script: line 4: 2472349 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36937400.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

larrybradley commented 1 month ago

The issue is due to excessive memory used by compound Astropy models: https://github.com/astropy/astropy/issues/16701

SterlingYM commented 1 month ago

I was looking into this issue (since it has been an issue for me while running psf photometry on 250x250 cutout with 100-400 sources: often requiring more than 16GB, sometimes 32GB on cluster) and noticed a behavior that the memory allocation (heap size) keeps increasing during a loop in which I'm newly initializing and calling IterativePSFPhotometry (as a wrapped function call).

Maybe I'm not understanding how python memory and garbage collection works, but I was expecting to see a rise and drop in memory usage for each iteration.

Is this a system-specific behavior, the upstream Astropy issue, or because of my object structure? I am trying to find a temporary workaround for the memory issue while we wait for the upstream fix.

Not a MWE, but a structure is something like this:

class MyClass():
    def __init__(self,data,error):
        self.data = data
        self.error = error

    def do_photometry(self,**kwargs):
        # . . .
        # initialize Source Grouper, LocalBackground, and DAOStarFinder here...
        self.do_some_stuff()
        self.do_more_stuff()
        # . . .

        # newly initialize IterativePSFPhotometry object
        psf_iter = IterativePSFPhotometry(**some_kwargs)
        phot_result = psf_iter(self.data, error=self.error)

        # don't models and other large memory-eating objects get deleted upon return?
        return phot_result

if __name__ == "__main__":
    my_class = MyClass(data,error)

    # iterating over some kwargs to test photometry with different settings
    for kwargs in list_of_kwargs[:10]:
        myclass.do_photometry(**kwargs)
        # memory heap size keeps increasing during the whole loop
        # resident size drops once or twice, but generally keeps increasing too

larrybradley commented 1 month ago

@SterlingYM The excessive memory issue is triggered when using source grouping. Source grouping creates a compound Astropy model. Every source in the group contributes part of the compound model so that the group can be fit simultaneously. If the number of sources in the group gets large, the compound Astropy model requires a huge amount of memory. When Astropy issue https://github.com/astropy/astropy/issues/16701 is fixed, this will no longer be a problem.

In the meantime, you can try limiting your group sizes with a larger separation if you are running into this issue.

larrybradley commented 1 month ago

I've changed the title of the issue to indicate that this is specifically triggered by source grouping.

SterlingYM commented 1 month ago

Adjusting group size helped with memory usage. Thank you for the suggestion!

astropy / photutils

Excessive memory usage when using source grouping #1808