astropy / photutils

Astropy package for source detection and photometry. Maintainer: @larrybradley
https://photutils.readthedocs.io
BSD 3-Clause "New" or "Revised" License
247 stars 138 forks source link

Memory errors with refactored code #1580

Closed keflavich closed 3 months ago

keflavich commented 1 year ago

As noted in another thread, I'm consistently getting out-of-memory errors when running the new PSFPhotometry fitter.

My fitting runs have died at the following stages as ID'd by the progressbar:

Fit source/group:   6%|▋         | 11347/177215 [05:24<1:34:59, 29.10it/s]
Fit source/group:   5%|▍         | 11405/228909 [05:57<30:55:50,  1.95it/s]
Fit source/group:   4%|▍         | 11486/262107 [07:02<26:22:39,  2.64it/s]
Fit source/group:  11%|█         | 11379/102664 [06:45<2:06:09, 12.06it/s]
Fit source/group:   2%|▏         | 11396/591444 [06:51<8:59:34, 17.92it/s]

These are pretty consistent endpoints.

I suspect the problem is that fit_info is being stored in memory. iirc, fit_info includes at least one, and maybe several, copies of the data. Can we minimized the fit_info before storing it? I think only param_cov is used downstream?

Note that I have 256GB of memory allocated for these runs, which imo is a very large amount to dedicate to photometry of a single JWST field-of-view.

larrybradley commented 1 year ago

I also suspect that the fit_info dictionary is the cause. It doesn't store a copy of the input data, but it does store the output from the fitters, which includes things like the fit residual, jacobian, etc. In general these should be small arrays (usually 5x5 is all that is needed for fitting since that is where most of the flux lies; the size determined by the fit_shape keyword), but I can see how that can add up when you have ~200k stars!

I'll want to keep at the least the fit residuals and the return status message. I'll remove the rest (perhaps as an option since I think your use case is probably on the extreme end). Some people may want all the fit info details.

larrybradley commented 1 year ago

Just curious -- what fit_shape are you using?

keflavich commented 1 year ago

11x11. If I switch to 5x5, I'd roughly expect to get to 4x more sources...

keflavich commented 1 year ago

...assuming only one footprint, of course, which is probably an underestimate

keflavich commented 1 year ago

Reducing fit_shape to 5,5 had no effect, which surprises me.

keflavich commented 1 year ago

I'm trying with a hack, changing:

                fit_info = self.fitter.fit_info.copy()

to

                fit_info = {key: self.fitter.fit_info.get(key)
                            for key in
                            ('param_cov', 'fvec', 'fun', 'ierr', 'status')
                           }
larrybradley commented 1 year ago

I did some testing, and I don't think the fit_info dict is the cause. I fit 15,000 stars (your failures were at <12,000 stars) with a fit_shape = (11, 11) and the fit_results size is only 194 MB. The PSF phot object total is 199 MB. The peak memory during the fitting was 7.7 GB. This was using a IntegratedGaussianPRF model. And I did not use grouping.

My next suspect is the PSF model. Are you using a GriddedPSFModel with very large (internal) PSF arrays and/or a large number of them?

larrybradley commented 1 year ago

Could you please send me your input PSF model?

keflavich commented 1 year ago

yes, I'm using a webbpsf model. Can be reproduced with:

                    import webbpsf
                    obsdate = '2022-08-28'
                    nrc = webbpsf.NIRCam()
                    nrc.load_wss_opd_by_date(f'{obsdate}T00:00:00')
                    nrc.filter = 'F405N'
                    nrc.detector = 'NRCA5'
                    grid = nrc.psf_grid(num_psfs=16, all_detectors=False, verbose=True, save=True)
                    psf_model = grid

I think... I haven't tested this; in production, the obsdate and some other variables come from FITS headers

EDIT: tested, this works now.

larrybradley commented 1 year ago

Thanks. Your PSF model is ~20 MB. 12_000 of them is ~233 GB (just for the PSF models, not the data, results, etc.). So that seems to be the culprit. The code returns a copy of the fit models. But it's copying the entire model. For the GriddedPSFModel that is unnecessary because the PSF grid is identical for each model. I can fix this.

larrybradley commented 1 year ago

@keflavich #1581 should fix your memory issues with GriddedPSFModel. Let me know if you still have issues. I can trim the fit_results dict if that's the case.

keflavich commented 1 year ago

Thanks. Past 15k already, so it looks like an improvement.

keflavich commented 1 year ago

Hm, still died, but got a lot further:

Fit source/group:  32%|███▏      | 52828/162563 [25:39<26:37:05,  1.15it/s]

Any idea for further workarounds? Splitting up the image sounds like a possible, but very annoying, way to get around this. Increasing memory isn't really practical

keflavich commented 1 year ago

@larrybradley I'd recommend reopening this one; it's not fully solved.

larrybradley commented 1 year ago

Yes, I'm working on some improvements now.

keflavich commented 1 year ago

Thanks. I'll test 'em right away!

larrybradley commented 1 year ago

1586 is another big reduction in memory for GriddedPSFModel. I have more ideas after that to further reduce memory, but I'll need to refactor a few things.

keflavich commented 1 year ago

OK, #1586 looks like it ran to completion, but then my code failed before I could check for sure because I was using get_residual_image instead of make_residual_image. #1558 has required significant revision to my production code.

keflavich commented 1 year ago

I had 2 of 3 filters work, and pretty fast! One still failed:

Model image:  77%|███████▋  | 201979/262107 [1:16:28<48:35, 20.63it/s]

Notably, this is at a later stage, so maybe this is solvable by other means

keflavich commented 1 year ago

ok, I thought they had completed, but it looks like all runs failed somewhere in the Model image stage, even when I gave more memory.

keflavich commented 1 year ago

Looking at the source code for make_model_image, I don't see any reason to run out of memory in that step. It looks like it's only allocating small amounts of memory temporarily, there are no plausible locations for a memory leak in that code.

Here's a record of my failures:

$ tail -n 5 *301997[123]*
==> web-cat-F182M-mrgrep-dao3019972.log <==
Fit source/group: 100%|██████████| 591444/591444 [5:09:48<00:00, 31.82it/s]
2023-07-15T02:59:07.502784: Done with BASIC photometry.  len(result)=591444 dt=18699.997509002686
2023-07-15T02:59:07.703571: len(result) = 591444, len(coords) = 591444, type(result)=<class 'astropy.table.table.QTable'>
Model image:  68%|██████▊   | 403940/591444 [1:14:47<49:36, 63.00it/s]/tmp/slurmd/job3019972/slurm_script: line 4: 76591 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F182M --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3019972.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

==> web-cat-F187N-mrgrep-dao3019971.log <==
2023-07-15T02:15:42.773270: Done with ITERATIVE photometry. len(result2)=208262  dt=8322.197668790817
2023-07-15T02:15:43.011038: len(result2) = 208262, len(coords) = 177215
Model image: 100%|██████████| 208262/208262 [06:07<00:00, 566.57it/s]
/blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py:117: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  pl.figure(figsize=(12,12))

==> web-cat-F212N-mrgrep-dao3019973.log <==
2023-07-15T00:03:02.336868: Done with diagnostics for BASIC photometry.  dt=8288.43006491661
2023-07-15T00:03:02.338916: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 227112/227112 [1:39:17<00:00, 38.12it/s]
Model image:  75%|███████▍  | 170227/227112 [27:14<08:46, 108.10it/s]/tmp/slurmd/job3019973/slurm_script: line 4: 84399 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F212N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3019973.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
login4.ufhpc /orange/adamginsburg/jwst/brick main$ tail -n 5 *30180[34][089]*
==> web-cat-F405N-mrgrep-dao3018039.log <==
2023-07-14T22:10:48.740898: Done with diagnostics for BASIC photometry.  dt=4295.998880624771
2023-07-14T22:10:48.743855: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 161319/161319 [1:02:50<00:00, 42.78it/s]
Model image:  22%|██▏       | 35267/161319 [05:57<17:55, 117.16it/s]/tmp/slurmd/job3018039/slurm_script: line 4: 40615 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F405N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018039.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

==> web-cat-F410M-mrgrep-dao3018040.log <==
Fit source/group: 100%|██████████| 262107/262107 [1:25:50<00:00, 50.89it/s]
2023-07-14T22:26:15.041140: Done with BASIC photometry.  len(result)=262107 dt=5186.740335702896
2023-07-14T22:26:15.127437: len(result) = 262107, len(coords) = 262107, type(result)=<class 'astropy.table.table.QTable'>
Model image:  77%|███████▋  | 201988/262107 [29:20<10:40, 93.89it/s]/tmp/slurmd/job3018040/slurm_script: line 4: 64717 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F410M --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018040.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

==> web-cat-F466N-mrgrep-dao3018038.log <==
2023-07-14T21:41:35.963349: Done with diagnostics for BASIC photometry.  dt=2544.2132999897003
2023-07-14T21:41:35.964734: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 101505/101505 [31:38<00:00, 53.46it/s]
Model image:  96%|█████████▋| 97827/101505 [13:04<00:26, 138.45it/s]/tmp/slurmd/job3018038/slurm_script: line 4: 10632 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F466N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018038.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

For context, I'm first running a basic PSFPhotometry run, then subsequently an IterativePSFPhotometry run. So, all of the runs except F182M had a successful PSFPhotometry run, but then failed during IterativePSFPhotometry.

I was running IterativePSFPhotometry with:

                phot_ = IterativePSFPhotometry(finder=daofind_tuned,
                                               localbkg_estimator=LocalBackground(5, 25),
                                               psf_model=dao_psf_model,
                                               fitter=LevMarLSQFitter(),
                                               maxiters=2,
                                               fit_shape=(5, 5),
                                               aperture_radius=2*fwhm_pix,
                                               progress_bar=True
                                              )

so maybe I can shrink the background area a bit and see if it completes.

keflavich commented 1 year ago

ah, another data point: I was making the model image (residual image) with 11x11 patches, not 5x5.

larrybradley commented 1 year ago

I don't how make_model_image is causing memory issues either. The only additional memory it requires is essentially for the output image (plus small temporary cutouts for an index array). I think make_residual_image does require an additional temporary array, which I removed in #1604.

I also reduced the memory footprint of PSFPhotometry more with #1603, but that should be minor. The models shouldn't be an issue after #1586 (200,000 models ~ 2.3 GB).

Are you using source grouping and have very large groups? I'm wondering if that could be an issue. Large groups should be avoided because it requires fitting a very large multi-dimensional parameter space (which can be slow and error prone, and probably memory intensive).

keflavich commented 1 year ago

No, I disabled the grouper, so it's not source grouping.

I'll see if this works better now, post #1604.

larrybradley commented 3 months ago

@keflavich I think this particular issue was fixed a while ago. As you've reported in #1808, the new memory issue is due to the creation of Astropy compound models that needed for the source grouping.