Closed keflavich closed 3 months ago
I also suspect that the fit_info
dictionary is the cause. It doesn't store a copy of the input data, but it does store the output from the fitters, which includes things like the fit residual, jacobian, etc. In general these should be small arrays (usually 5x5 is all that is needed for fitting since that is where most of the flux lies; the size determined by the fit_shape
keyword), but I can see how that can add up when you have ~200k stars!
I'll want to keep at the least the fit residuals and the return status message. I'll remove the rest (perhaps as an option since I think your use case is probably on the extreme end). Some people may want all the fit info details.
Just curious -- what fit_shape
are you using?
11x11. If I switch to 5x5, I'd roughly expect to get to 4x more sources...
...assuming only one footprint, of course, which is probably an underestimate
Reducing fit_shape
to 5,5 had no effect, which surprises me.
I'm trying with a hack, changing:
fit_info = self.fitter.fit_info.copy()
to
fit_info = {key: self.fitter.fit_info.get(key)
for key in
('param_cov', 'fvec', 'fun', 'ierr', 'status')
}
I did some testing, and I don't think the fit_info
dict is the cause. I fit 15,000 stars (your failures were at <12,000 stars) with a fit_shape = (11, 11)
and the fit_results
size is only 194 MB. The PSF phot object total is 199 MB. The peak memory during the fitting was 7.7 GB. This was using a IntegratedGaussianPRF
model. And I did not use grouping.
My next suspect is the PSF model. Are you using a GriddedPSFModel
with very large (internal) PSF arrays and/or a large number of them?
Could you please send me your input PSF model?
yes, I'm using a webbpsf model. Can be reproduced with:
import webbpsf
obsdate = '2022-08-28'
nrc = webbpsf.NIRCam()
nrc.load_wss_opd_by_date(f'{obsdate}T00:00:00')
nrc.filter = 'F405N'
nrc.detector = 'NRCA5'
grid = nrc.psf_grid(num_psfs=16, all_detectors=False, verbose=True, save=True)
psf_model = grid
I think... I haven't tested this; in production, the obsdate and some other variables come from FITS headers
EDIT: tested, this works now.
Thanks. Your PSF model is ~20 MB. 12_000 of them is ~233 GB (just for the PSF models, not the data, results, etc.). So that seems to be the culprit. The code returns a copy of the fit models. But it's copying the entire model. For the GriddedPSFModel
that is unnecessary because the PSF grid is identical for each model. I can fix this.
@keflavich #1581 should fix your memory issues with GriddedPSFModel
. Let me know if you still have issues. I can trim the fit_results
dict if that's the case.
Thanks. Past 15k already, so it looks like an improvement.
Hm, still died, but got a lot further:
Fit source/group: 32%|███▏ | 52828/162563 [25:39<26:37:05, 1.15it/s]
Any idea for further workarounds? Splitting up the image sounds like a possible, but very annoying, way to get around this. Increasing memory isn't really practical
@larrybradley I'd recommend reopening this one; it's not fully solved.
Yes, I'm working on some improvements now.
Thanks. I'll test 'em right away!
GriddedPSFModel
. I have more ideas after that to further reduce memory, but I'll need to refactor a few things.OK, #1586 looks like it ran to completion, but then my code failed before I could check for sure because I was using get_residual_image
instead of make_residual_image
. #1558 has required significant revision to my production code.
I had 2 of 3 filters work, and pretty fast! One still failed:
Model image: 77%|███████▋ | 201979/262107 [1:16:28<48:35, 20.63it/s]
Notably, this is at a later stage, so maybe this is solvable by other means
ok, I thought they had completed, but it looks like all runs failed somewhere in the Model image
stage, even when I gave more memory.
Looking at the source code for make_model_image
, I don't see any reason to run out of memory in that step. It looks like it's only allocating small amounts of memory temporarily, there are no plausible locations for a memory leak in that code.
Here's a record of my failures:
$ tail -n 5 *301997[123]*
==> web-cat-F182M-mrgrep-dao3019972.log <==
Fit source/group: 100%|██████████| 591444/591444 [5:09:48<00:00, 31.82it/s]
2023-07-15T02:59:07.502784: Done with BASIC photometry. len(result)=591444 dt=18699.997509002686
2023-07-15T02:59:07.703571: len(result) = 591444, len(coords) = 591444, type(result)=<class 'astropy.table.table.QTable'>
Model image: 68%|██████▊ | 403940/591444 [1:14:47<49:36, 63.00it/s]/tmp/slurmd/job3019972/slurm_script: line 4: 76591 Killed /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F182M --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3019972.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
==> web-cat-F187N-mrgrep-dao3019971.log <==
2023-07-15T02:15:42.773270: Done with ITERATIVE photometry. len(result2)=208262 dt=8322.197668790817
2023-07-15T02:15:43.011038: len(result2) = 208262, len(coords) = 177215
Model image: 100%|██████████| 208262/208262 [06:07<00:00, 566.57it/s]
/blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py:117: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
pl.figure(figsize=(12,12))
==> web-cat-F212N-mrgrep-dao3019973.log <==
2023-07-15T00:03:02.336868: Done with diagnostics for BASIC photometry. dt=8288.43006491661
2023-07-15T00:03:02.338916: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 227112/227112 [1:39:17<00:00, 38.12it/s]
Model image: 75%|███████▍ | 170227/227112 [27:14<08:46, 108.10it/s]/tmp/slurmd/job3019973/slurm_script: line 4: 84399 Killed /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F212N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3019973.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
login4.ufhpc /orange/adamginsburg/jwst/brick main$ tail -n 5 *30180[34][089]*
==> web-cat-F405N-mrgrep-dao3018039.log <==
2023-07-14T22:10:48.740898: Done with diagnostics for BASIC photometry. dt=4295.998880624771
2023-07-14T22:10:48.743855: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 161319/161319 [1:02:50<00:00, 42.78it/s]
Model image: 22%|██▏ | 35267/161319 [05:57<17:55, 117.16it/s]/tmp/slurmd/job3018039/slurm_script: line 4: 40615 Killed /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F405N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018039.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
==> web-cat-F410M-mrgrep-dao3018040.log <==
Fit source/group: 100%|██████████| 262107/262107 [1:25:50<00:00, 50.89it/s]
2023-07-14T22:26:15.041140: Done with BASIC photometry. len(result)=262107 dt=5186.740335702896
2023-07-14T22:26:15.127437: len(result) = 262107, len(coords) = 262107, type(result)=<class 'astropy.table.table.QTable'>
Model image: 77%|███████▋ | 201988/262107 [29:20<10:40, 93.89it/s]/tmp/slurmd/job3018040/slurm_script: line 4: 64717 Killed /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F410M --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018040.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
==> web-cat-F466N-mrgrep-dao3018038.log <==
2023-07-14T21:41:35.963349: Done with diagnostics for BASIC photometry. dt=2544.2132999897003
2023-07-14T21:41:35.964734: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 101505/101505 [31:38<00:00, 53.46it/s]
Model image: 96%|█████████▋| 97827/101505 [13:04<00:26, 138.45it/s]/tmp/slurmd/job3018038/slurm_script: line 4: 10632 Killed /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F466N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018038.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
For context, I'm first running a basic PSFPhotometry run, then subsequently an IterativePSFPhotometry
run. So, all of the runs except F182M had a successful PSFPhotometry run, but then failed during IterativePSFPhotometry.
I was running IterativePSFPhotometry with:
phot_ = IterativePSFPhotometry(finder=daofind_tuned,
localbkg_estimator=LocalBackground(5, 25),
psf_model=dao_psf_model,
fitter=LevMarLSQFitter(),
maxiters=2,
fit_shape=(5, 5),
aperture_radius=2*fwhm_pix,
progress_bar=True
)
so maybe I can shrink the background area a bit and see if it completes.
ah, another data point: I was making the model image (residual image) with 11x11 patches, not 5x5.
I don't how make_model_image
is causing memory issues either. The only additional memory it requires is essentially for the output image (plus small temporary cutouts for an index array). I think make_residual_image
does require an additional temporary array, which I removed in #1604.
I also reduced the memory footprint of PSFPhotometry
more with #1603, but that should be minor. The models shouldn't be an issue after #1586 (200,000 models ~ 2.3 GB).
Are you using source grouping and have very large groups? I'm wondering if that could be an issue. Large groups should be avoided because it requires fitting a very large multi-dimensional parameter space (which can be slow and error prone, and probably memory intensive).
No, I disabled the grouper, so it's not source grouping.
I'll see if this works better now, post #1604.
@keflavich I think this particular issue was fixed a while ago. As you've reported in #1808, the new memory issue is due to the creation of Astropy compound models that needed for the source grouping.
As noted in another thread, I'm consistently getting out-of-memory errors when running the new
PSFPhotometry
fitter.My fitting runs have died at the following stages as ID'd by the progressbar:
These are pretty consistent endpoints.
I suspect the problem is that
fit_info
is being stored in memory. iirc,fit_info
includes at least one, and maybe several, copies of the data. Can we minimized thefit_info
before storing it? I think onlyparam_cov
is used downstream?Note that I have 256GB of memory allocated for these runs, which imo is a very large amount to dedicate to photometry of a single JWST field-of-view.