PSFPhotometry classes are really slow

Gabriel-p commented 6 years ago

I'm testing the DAOPhotPSFPhotometry class with a single iteration, which as far as I understand should equal a single run of the allstar IRAF task.

Even with fixed coordinates (obtained with a previous run of the star finder) in the IntegratedGaussianPRF psf_model I use, DAOPhotPSFPhotometry takes around 10 min to complete for ~900 stars in a modest laptop. The allstar task on the other hand, processing the same number of stars, takes just a few seconds.

Is this expected or is there something wrong with my parameter settings?

eteq commented 6 years ago

I expect it to be a bit slower since we haven't really done an optimizing run (and the DAOPHOT fitter is optimized for that particular problem in a way the scipy fitters are not), but that does seem particularly slow to me. Could you have a very large cutout area so that it's doing a lot of simultaneous fits instead of separating out into many individual groups? That's what I've generally seen as the biggest source of slowdown...

Gabriel-p commented 6 years ago

Yes, I would expect Python to be a bit slower too but this is a lot slower. I tried again for ~7000 sources: DAOPhotPSFPhotometry took almost 35 min to complete, while allstar again finished in just a few seconds (definitely less than a minute).

The number of groups (group_id column) is ~6600, so I don't think this is the reason?

larrybradley commented 5 years ago

@Gabriel-p Can you please try this again with astropy 3.1.2? There was bug with astropy multidimensional compound models that caused a massive slowness. It was fixed in https://github.com/astropy/astropy/pull/8349.

Gabriel-p commented 5 years ago

@larrybradley honestly, I won't have time to check for a while. I have to go back to some old files I was processing almost 2 years ago, and re-check the code. I promise I will check it as soon as I can.

pllim commented 5 years ago

I am seeing performance issues as well when I run the BasicPSFPhotometry step in https://github.com/spacetelescope/da5-notebooks/blob/master/notebooks/psf_grid_simulation_photometry/GriddedPSFModel_simulation_and_photometry.ipynb . I was using 512x512 image with 25 stars and a fitshape of (101, 101). Decreasing fitshape did not help much as far as performance is concerned. Took about 10 seconds to run that step.

Some version info:

Python 3.7.1
Numpy 1.15.4
Matplotlib 2.2.3
Astropy 3.1.2 (the release with astropy/astropy#8349)
Photutils 0.6
WebbPSF 0.8

Here is the output from %prun tbl = phot(data, init_guesses=init_tbl) from the notebook (using the smaller array I mentioned, not the 2k by 2k one in that notebook shown):

prun_phot.txt

p.s. Perhaps as per https://github.com/astropy/astropy/pull/7945#issuecomment-470536377 , having that PR in would speed things up, as I see deepcopy mentioned pretty high up the list in the prun output.

larrybradley commented 5 years ago

Thanks, @pllim!

bsipocz commented 5 years ago

fyi: the tbl = phot(data, init_guesses=init_tbl) step from that notebook hasn't got quicker using post modeling refactor (though the PR linked above hasn't been merged).

versions used:

In [25]: print(numpy.__version__, astropy.__version__, photutils.__version__)                                                                   
1.18.0.dev0+ea965e4 4.0.dev25533 0.7.dev3836

pllim commented 5 years ago

@bsipocz , did you try to profile it? Is the bottleneck at least different now, after refactoring by @perrygreenfield ?

bsipocz commented 5 years ago

nope, I didn't dig deep enough (neither saved the reports), but with a quick glance the output looked very similar to what I got with 3.2.1. Master run for ~17s while 3.2.1 ~16s.

larrybradley commented 5 years ago

CC: @Onoddil

fivestars95 commented 3 years ago

Hello, has this photutils PSF photometry slowness issue been solved or improved? Or are there any other updates you've heard?

larrybradley commented 3 years ago

@fivestars95 Not yet, but it's something I hope to have time to look at in the coming months.

Onoddil commented 3 years ago

Appears I completely forgot to actually reply to this thread when I was assigned this issue to investigate way back, sorry!

I actually did run some profiling -- at least for zeroth order information on where the problem might lie -- so I've dug out my old scripts, updated a few things, and thought I'd just (finally) write up the results.

I've compiled a fairly basic test: generate a few hundred IntegratedGaussianPRF sources on a ~1000x1000 pixel array via make_gaussian_prf_sources_image, add some background counts, add Poissonian noise with make_noise_image, and run a very similar pipeline to that in the documentation: find_peaks to generate a "real world" input data table, MMMBackground to get the background counts to set threshold, then call DAOPhotPSFPhotometry (as per the original comment on this issue, although that's really just IterativelySubtractedPSFPhotometry; I also kept niter=3 by default, instead of limiting it to a single loop, but I'm not sure if it needs all of them).

For no particular reason I've run 5 loops of two different versions: one in which both __init__ and __call__ are run inside the loop, and one where the photometry is initialised outside the loop and just called Nloop times; this doesn't seem to make any difference but I've just not bothered changing it from when I tested that before, the runtime is basically entirely within the call.

I've profiled the calls with line_profiler, and attach the results at the bottom of the post.

The end result is that DAOPhotPSFPhotometry via IterativelySubtractedPSFPhotometry, with no init_guesses supplied, subtracts a background and then calls self._do_photometry. The background subtraction takes ~1% of the (profiled) runtime, so we go a step deeper into _do_photometry. Here we see two calls to self.finder (in this instance DAOStarFinder), taking 3.5% of the runtime within _do_photometry between them, a percent runtime for self.group_maker (DAOGroup here), and 95% runtime for super().nstar.

Once again, following the overwhelming majority of the runtime, we can profile nstar (this time from BasicPSFPhotometry), and find 2% of THIS runtime (still 94% the total __call__ time) taken by get_grouped_psf_model, 3% from self._model_params2table, 2.5% from result_tab = vstack..., 5% from unc_tab = vstack..., and, the two major contributors: 32% from self.fitter and 53.5% from subtract_psf.

I can't speak to the fitter, which is the default LevMarLSQFitter, but >half the runtime is from subtract_psf. Here, unfortunately, we hit a slightly less obvious culprit; subtract_psf is split evenly within its call: 2.5% from subbeddata = add_array... 4% relative runtime from psf.copy() 28% indices = np.indices(data.shape) 21.5% subbeddata = data.copy() 21.5% (twice) for y = extract_array... and x = extract_array...

EDIT: with scipy.__version__ of 1.5.2 the astropy LevMarLSQFitter calls spend 81% of their time in scipy.optimize.leastsq, with 7% in _validate_model, 2% in _model_to_fit_params, 3.5% in _fitter_to_model_params, and 5.5% in sum_sqrs = np.sum.... The trail for that ~third of the runtime goes dead for me there because I don't particularly understand the scipy MINPACK wrappers, and figure the line profiling will break down at this point. I also assume that, being wrappers for some low-level language functions, these calls are fairly well optimised already.

So it seems that half of the runtime is from the subtraction of PSFs to form the next iteration's residual image, but split four ways across four calls.

It's printed in the script outputs, but for completeness, the versions of a few important packages are: photutils: 1.1.0.dev127+g2bba1caa astropy: 4.0.2 numpy: 1.19.2; photutils is just the master branch up to 1140 being merged.

Attached are (in .txt because it wont' support .py apparently) the script I wrote to generate/fit test data, the output of the line profiling, and the print outputs from the script itself to verify the timings and whatnot: psf_photometry_speed.txt speed_results.txt script_outputs.txt

pllim commented 3 years ago

add_array and extract_array are from astropy.nddata.utils. 😬

Onoddil commented 3 years ago

Ah thanks, I forgot those were astropy functions. Running a quick profile on those two as well:

extract_array (21.5% x2 of subtract_psf): 12% isscalar(shape) check 2.5% isscalar(position) check 2% extracted_array = array_large[large_slices] 1.5% if (extracted_array.shape != shape) and (mode == 'partial'): 78.5% overlap_slices (perhaps unsurprisingly)

add_array (2.5% runtime): 7% the various checks on large_shape and small_shape within the if statement 75% overlap_slices 15% array_large[large_slices] += array_small[small_slices]

So those are mostly dominated by the overlap_slices calculation.

All this said, the way the profiler reports the timings there is loss in the system. For example, in the above speed_results.txt, nstar says its 8870 hits of line 422 are 242 seconds, but the total time we're actually in subtract_psf is reported as 172 seconds. That could be overhead that's actually in the system (i.e., time to enter/exit the function, return the parameters, pass by value, etc.), or overhead that the profiler is introducing (although it claims it subtracts its own overhead off the reported numbers, iirc).

This didn't really matter in the above comment, as those runtimes were sufficiently large that the time spent in the function was the largest term (although 172 vs 242 is only 71% of the total time to call image = subtract_psf(image, self.psf_model, param_table, subshape=self.fitshape)...).

However, for the extra profiling I ran just now (for 200 sources, not 900), the profiler reports that subtract_psf spends 16 seconds calling the two extract_array functions, and 1.02 seconds calling add_array, but then reports that it is actually in extract_array for 0.63s, and in add_array for 0.24s! So actually it seems that overlap_slices (which has total of 0.7s runtime) is NOT the important part, but something within the function calls for those two lines. (This profiling can be seen in speed_results2.txt)

In full, those are: subbeddata = add_array(subbeddata, -psf(x, y), (y_0, x_0)) and y = extract_array(indices[0].astype(float), subshape, (y_0, x_0)) x = extract_array(indices[1].astype(float), subshape, (y_0, x_0))

Figuring the only actual code being run in the call is the astype(float) conversion I moved that out to its own line, adding inds_float = indices.astype(float) and then calling extract_array with inds_float[1]. As can be seen in speed_results3.txt, this moves 39% of the run time to the float conversion, which combines with indices = np.indices(data.shape) to 70% of subtract_psf's runtime -- with 22% left to the data.copy() a line below it.

Thus if something can be done to remove any, or all, of

    indices = np.indices(data.shape)
    inds_float = indices.astype(float)
    subbeddata = data.copy()

in subtract_psf the PSF fitting will get a big speed up. It can't get much more than 3x faster unfortunately without improving the speed of the fitter we're using by default, but on my 2019 macbook pro 13" I was fitting 900 sources in ~50 seconds so hopefully we're already well optimised from the 10 minutes reported previously.

pllim commented 3 years ago

Thanks for the detailed report, @Onoddil !

If the indices and copy are unavoidable, can also think about using Dask or multiprocessing. Cython crossed my mind too though I am not sure how to do it in a sane way if the bottleneck is in a different library (e.g., scipy or numpy).

Then again, attempting optimization for a case of, say, 900 stars, might have side effect for other use cases as well. So, when work is done to optimize, it needs to consider several vastly different use cases.

I might just be spewing things that everyone already knows, so this is more like a note for my future self.

pllim commented 3 years ago

While we're at this topic, I see that there is a https://github.com/astropy/photutils-benchmarks but it hasn't been updated for 6 years. Should I open a new issue?

larrybradley commented 3 years ago

Many thanks, @Onoddil! This is great.

@pllim That's an old repo that had only some aperture photometry benchmarks. IIRC, it was running on @astrofrog 's personal mac mini. I'll add it to the backlog to consider a more comprehensive set of benchmarks (using asv).

pllim commented 3 years ago

@larrybradley , do you plan to re-use that same repo? If not, might wanna archive it. Thanks!

mjyb16 commented 3 years ago

@Onoddil @pllim Super helpful seeing the code profiling. I was running IterativelySubtractedPSFPhotometry on a globular star cluster a couple months ago and was disappointed in its slow performance (several minutes for ~1800 sources, and this was after setting all parameters to the minimum for optimal speed).

I think it is somewhat surprising that the indices() and copy() functions are taking up such a big part of the processing time, the array sizes must be fairly large even after considering that those functions are called repeatedly.

In addition to Dask/multiprocessing and Cython, Numba could be a good option for speeding things up since we are dealing with numpy arrays and functions, which is what Numba is designed for. Hopefully this all gets sorted out!

pllim commented 3 years ago

Numba might be too hardware oriented, as not everyone running photutils have a heavy duty graphics card. While it can be one of the options, I don't think it should be the only option.

mjyb16 commented 3 years ago

The default compilation for Numba is actually for CPUs, so having a GPU is not a requirement. I agree that CUDA/GPUs should definitely be optional since that is a pretty specialized domain for now.

janerigby commented 2 years ago

Has this issue been fixed yet? I'm running PSF-fitting photometry on some JWST NIRCam commissioning data, and there must be something wrong, because it's taking 30 min to run, even for a small cutout (about 1 detector of NIRCam). Below is how I'm calling PhotUtils:

bkgrms = MADStdBackgroundRMS()
std = bkgrms(image)
iraffind = IRAFStarFinder(threshold=3.5*std, fwhm=fwhm_psf)
daogroup = DAOGroup(2.0 * fwhm_psf)
mmm_bkg = MMMBackground()
fitter = LevMarLSQFitter()
photometry = IterativelySubtractedPSFPhotometry(finder=iraffind, group_maker=daogroup, bkg_estimator=mmm_bkg, \
    psf_model=epsf, fitter=LevMarLSQFitter(), niters=1, aperture_radius=30, fitshape=(61, 61))

Happy to send the notebook and input files to someone with data access.

larrybradley commented 2 years ago

@janerigby I don't know the PSF-fitting photometry code well, but I can take a look if you send me the notebook and data. One quick suggestion I have is to try replacing DAOGroup with DBSCANGroup (https://photutils.readthedocs.io/en/latest/api/photutils.psf.DBSCANGroup.html#photutils.psf.DBSCANGroup). Note that DBSCANGroup requires having the scikit-learn optional dependency installed.

I'm also curious if BasicPSFPhotometry (i.e., instead of IterativelySubtractedPSFPhotometry) also has a long run time. The later essentially runs the former in a loop until no more sources are detected. I'm wondering if the subtraction step is a bottleneck (as suggested by @Onoddil 's profiling above) or if the loop is running too many iterations.

larrybradley commented 2 years ago

@janerigby Hold off on sending me JWST data. I'm not on an instrument team, so I don't know if I have "data access".

larrybradley commented 9 months ago

I wrote a new PSFPhotometry (and IterativePSFPhotometry) class from the ground up. It was released in 1.9.0 and is significantly faster (factors up to 200x). I plan to improve this further with multiprocessing in a future release. The old BasicPSFPhotometry and IterativelySubtractedPSFPhotometry classes were deprecated and should no longer be used (they will be removed in a couple releases).

astropy / photutils

PSFPhotometry classes are really slow #631