Map out parameter choices where EPSFBuilder fails, document it clearly, and give heuristic warnings

eteq commented 4 years ago

This is the follow-on issue I promised in https://github.com/astropy/photutils/issues/951#issuecomment-538033961

One of the results to come out of that was that, regardless of whether there's an actual bug in the algorithm, more guidance needs to be available to users about how to set the parameters, how many PSF stars there need to be, etc. So to rectify this (and provide concrete answers to some of the speculation mentioned in #951), I propose the following experiment:

Build a notebook that builds a dataset that's a cross between the existing example and @larrybradley's example from #951 (https://nbviewer.jupyter.org/gist/larrybradley/57371f860741e7d0189a48e085b32e63). That is, it would do something like this:

Use a relatively small image/set of stars for reasonable speed i.e. more like @larrybradley's example than what's in the docs, using a gaussian PSF (to make "truth" easier) but also with a background w/noise and not be right on the pixel grid, to make it slightly more realistic. Then try building the psf varying all of these parameters:

Oversampling (including 1 for the case where the psf is oversampled)
Size of the PSF relative to the image pixels (at a minimum covering undersampled, critically sampled, and oversampled to account for both space and ground based options)
number of PSF stars (probably best phrased as "relative to oversampling^2"
number of dithers per star

For each of the various cases, build the psf and then compare the result to the true PSF. Once we have that, we will have actual data-driven answers to the assertions in #951. Then with that in hand, the notebook can be translated into a sphinx document (nbconvert is helpful here), which we will add to the photutils docs (and implicitly then also the tests).

With that documentation available, we can add some heuristic checks to the top of the psf builder that raise warnings when some of the "danger zones" are met. E.g.:

if len(psf_stars)==1 and oversampling>10:
    warnings.warn('OK friend: thanks for your interest in building EPSFs. But '
                  'you may have had a few too many 🍻 when doing this because '
                  'you really are not going to get a well-sampled PSF with '
                  'just one PSF star. See <link to docs> for more on how  to '
                  'choose parameters for the EPSFBuilder.')

That then leads the user to something helpful while still warning them that it probably isn't going to work out for their particular dataset.

Onoddil commented 4 years ago

Thanks for the summary, @eteq. I have started looking into this in more quantitative detail, using @larrybradley's notebook as the foundation. So far I have extended my parameter space to several Gaussian sigma values (0.5, 1, and 3 pixels, which hopefully represent roughly under, critically, and oversampled PSFs), and different inputs for norm_radius and recentering_maxiters, to explore at least the impacts of a bad normalisation and "speed of failure" re: residual resamplings.

So far norm_radius is an independent parameter (which is good); all it does is set the overall scaling. I have some ideas on how to test whether the user has input a poor choice for this, but overall it doesn't appear this has anything to do with fitting success/failure, and is more a space-based bias in its default.

Re: my previous comment about why the residuals change 'magically fix' things... It turns out there are two-fold easy changes that can be made: the first is the assignment of star samplings to residual grid points. Currently, the assigning of a star sample to a residual (star normalised flux minus ePSF evaluated at its x, y, to then be sigma-clipped and the mean added to the ePSF grid data points) is done on a nearest neighbour scheme, via _py2intround. However, as shown in Figure 5 of Anderson & King (2000), all samples within a grid point spacing are used by a given residual for its sigma-clipping (rather than half a grid point, which nearest neighbour mapping would do), so a simple fix is to add a given star sample residual to the four grid points surrounding it (its nearest one, and then depending on the sign of the relative offset from star sample to grid point, three more points to the left/right, top/bottom). This smooths out the residual data points, increasing the number of samplings available to each residual grid point by a factor 4, and, specifically for the case of @larrybradley's example notebook, helps with the poor dither pattern: the given "dither pattern" of np.arange(4) * 0.25 - 0.375 places source centroids at [-0.375, -0.125, 0.125, 0.375], which is exactly between grid points, at [-0.5, -0.25, 0, 0.25, 0.5]. Thus the "spreading out" of the residuals helps alleviate the worst possible dither pattern in this instance. Note: even with the NN scheme for assigning residuals, I can create excellent ePSFs using the 'on' dither pattern, instead of the 'halfway between' dither pattern. Second, a small change I have made, for which I am less certain of any explanations, is to further follow the scheme laid out in Figure 8, Anderson & King (2000) and not recenter the ePSF on the final iteration (the flow diagram asking if this is the 5th iteration and either recentering if it is not, or normalising before quitting if this is the 5th iteration). This small change, for the 'fails to fit' example given in the above notebook (NN residual scheme, maxiters=10, recentering_maxiters=20, norm_radius=5.5), stops the failure and allows for a good-looking (albeit incorrectly normalised) Gaussian ePSF to be recovered. Currently unsure of why the final recentering should have such a drastic effect, but can easily change the workflow to match AK00.

I will also follow this up with more rigorous testing, but it appears that it is the total number of iterations that matter (i.e., recentering_maxiters times maxiters), rather than just recentering_maxiters. More residual samplings more quickly kill the poor fit setups for a given maxiters, but setting recentering_maxiters=1 and maxiters=50 should produce similar results to recentering_maxiters=5 and maxiters=10 (these preliminary tests were @eteq's, so I'll also run some once I have finished some other things). That said, the current default for recentering_maxiters is a factor four larger than the default value quoted in AK00, so I will also just put a super minor PR in to fix that slightly odd, slipped-through-the-cracks potential error.

However, in terms of preliminary guesses as to what causes failures at "random": 1) Oversampling & number of PSF stars: number of stars per square-oversampling will be important in 'real' data, to better combat noise/crowding etc. via sigma-clipping, but if sources are all sufficiently bright then at least having all residual points have a few points is the minimum, which the non-NN residual assigning scheme will help with. We can also presumably come up with some metric where if X% of residual points have below Y samples assigned to them a warning is raised, with the user being told to either down their oversampling factor or input more stars. 2) Number of dither stars & relative size of PSF: These two are also linked, as the importance of dithering depends on the initial wrongness of your input positions. Fluxes are easy, as aperture photometry is precise enough as an input, but if the pixel phase (the systematic incorrectness of a guessed star centroid) is sufficiently large, then stars centers can move enough to move them into the next grid point over, and if enough of them suffer this then the iterative fitting process will potentially not recover. I'm not sure how you quantify this yet, because to know if dithers are needed you need to know the relative PSF size or the maximum pixel phase, but the first of those requires successfully fitting the PSF and the latter requires the dithers to get x - x_avg... How do you know, a priori, if an image is in the nice, safe, doesn't-need-dithering ground-based regime, or the worrisome, kinda-needs-dithering space-based regime? Tutorials/documentation would be a help, but what happens if a user doesn't read the how to in depth?

Onoddil commented 4 years ago

(See the bottom of this long post for questions I would like clarifications to; everything before that is mostly context, details, and record keeping)

My test suite now is a bit more complex, but also set up to more or less probe any of these issues in any combination now. I hadn't bothered reporting back on any intermediate steps until I was reasonably convinced of where the issue could lie, and now I've kinda hit a dead end.

For context, I am running N~5 random draws of M stars per square oversampling, each with D dithers, randomly placing these within [-0.5, 0.5] offset from a given pixel (each star spaced by >size pixels, so there's no overlap in crowding etc.), with amplitude flux drawn from [100, 1000] (so flux a factor 2 pi sigma^2 larger). These are then padded with a background flux of 20/s, multiplied by an exposure time of 100 seconds, re-drawn from a poissonian distribution, divided by exposure time again and background subtracted (with the known original background flux). We should therefore hopefully have a varying range of brightnesses of isolated, Gaussian PSFs, with sensible poissonian statistics applied.

I have run a bunch of combinations of star density, dither patterns (needing a relatively sensible pattern to ensure that my fake WCSs make sense and are easily createable), and PSF size, but for this update I mostly am focusing on a pretty hard case: 1 star per square oversampling (so an average of 1 sampling per ePSF grid point), no dithers (or a dither of 1x1...), and a Gaussian PSF with sigma=0.5, in an attempt to represent a HST-like, space-based PSF that should be relatively undersampled (with 46.6% of the flux of a centrally positioned source falling within the first pixel, comparable to most HST PSFs). I am therefore only varying, at the moment, algorithm-related cogs, primarily the recentering of the ePSF.

The things to vary, which are different between the original, non-normalised-but-non-blowing-up v0.6 code and the new, normalised-but-sometimes-falls-over v0.7 code: 1) following AK00's figure 8 work flow, do we recenter the ePSF after the Nth iteration or not? 2) do we recenter via re-evaluations of the ePSF, or do we recenter the samplings such that a new-derivation of the ePSF residuals should produce a centered ePSF, as per AK00 section 4.2.1? 3) do we do a single recentering in each 'adjust-and-smooth' (AK00, 4.2.1) inner loop, or do we have another loop, where we iterate until the centroid stops moving? (not sure of a citation for this, it's just a difference between the v0.6 and 0.7 codes so I figured I'd test it)

This produces 5 unique tests, rather than 8, as the 'update the samplings' recenter method doesn't actually do anything on the final update-and-smooth inner loop, as you go immediately into star fitting which updates the positions anyway, removing 1 combination; and option (3) doesn't, as far as I have understood it, really make any sense (at least without serious code overhaul) for the 'update the samplings' recentering, removing from the combinations cases where the recentering would be done by sampling shift. We thus have: 1+2) iterate centroiding, shift by evaluation of ePSF, do/don't recenter on final inner loop iteration 3+4) single centroid shift in inner loop, shift by evaluation, do/don't recenter on final iteration 5) single centroid shift per loop, shift by sampling move, don't recenter on final iteration

For each dataset/algorithm combination, we can also test how much poor initial positions change things by initially updating the star cutout_centers with, say centroid_com fits, or not. Then, finally, the number of (at present, in v0.7) recenter_maxiters and maxiters can be varied, and compared with a v0.6 maxiters=20 'original result', running (1, 1), (1, 20), (5, 20), (20, 5) as combinations of recentering and overall loop counts, respectively. (1, 20) is useful because it essentially reduces back to the original v0.6 code, which did not contain this second inner loop (except insofaras it had one for iteratively fitting the centroid itself).

For tests 1+2, with the iterative centroiding, the final recentering doesn't matter too much, although not including it gives slightly worse residuals (except when recentering_maxiters=1, where it makes sense that you kinda need at least one recentering). These residuals agree, for the most part, with the v0.6 code's results: when one set of random star positions fits well, so does the other, and vice versa. None of these go off the rails.

Tests 3+4: for the same tests, except without the 'random walk' to a fully centered ePSF during the residual-subtract-smooth loop, the results are worse than the v0.6 case (with a small fraction of the tests producing horrible ePSFs, and occasionally producing the banding pattern seen previously). Whether the final recentering is better or not is unclear, but recentering_maxiters=1 does give non-terrible fits in all 5 cases, although a factor maybe 2 worse than the case where the centroiding is itself iterated.

Test 5: changing the recentering to the case where the samples are shifted avoids any major 'blow up' issues, but the ePSFs seem to sometimes be offset slightly (i.e., the residuals are positive on one side and negative on the other). This could be a simple code error on my part, since this was a new addition to the test after combing AK00 for any details that I may have missed previously. Other than that, this gives residual values of the order those of the v0.6 code version.

Given what a terrible set up this test is (randomly placed low number of stars, no dither, small PSF relative to pixel size, plus a, hopefully, "healthy" amount of image noise) it would seem that the issue could most likely be the recentering. It appears the current version is a 'halfway house' of two algorithms that work, unfortunately. The v0.6 code option which iterates (either a 2nd or 3rd loop, depending on if you still include recentering_maxiters in its v0.7 form, with the residual subtraction+smoothing in its remit) over the ePSF evaluation to get to a stable center solution, or the full AK00 option of sampling moving (currently coded as a slightly adjustment to star.cutout_center, as the samplings are i - x_star).

I guess the question now (for those who are still with me) is one of how closely this code should actually follow AK00. The 'inner loop' of residual fitting, smoothing, and centering is from the original paper, but in the WFC3 ISR it appears that this loop is dropped, and for each outer loop star average positions are computed (to remove pixel phase from the undersampled images, etc.), one iteration of residuals -> smoothing -> centering is performed, and then the stars are fit with the new ePSF (and repeat). This more closely agrees with the v0.6 version (plus the, then second, inner loop of 'random walks' to fit the centroid, which is different than the ISR's small steps of shifts & compute asymmetry criterion, but could spiritually be similar). Thus the old v0.6 code had split from its cited paper, and perhaps I without realising this, naively putting the algorithm back as close as possible to its quoted reference without making too large a code upheaval, undid some changes that were made for some reasons. Thus my questions now are:

START HERE FOR CLARIFICATION QUESTIONS 1) was the removal of the inner loop of residual fit, smooth and recenter done deliberately from AK00 to the 2016 WFC3 ISR, or is this a singular case of WFC3 vs WFPC2; which is the better 'overall' case? Given the time gap from one to the other it would make sense to take the latest updated algorithm, but I thought I'd check. 2) in AK00, it is unclear whether the residuals are cumulatively subtracted from the 'current PSF model', or if in each inner loop iteration (after the samplings were shifted in the N-1th loop) a fresh residual is computed, from the ePSF grid as it was before the inner loop began (i.e., in AK00 figure 8 language, does $\Psi_E$ change within box (1)? Does current psi_e change outer loop times, or outer loop times inner loop times?) Is the inner loop basically there to get the centering down, or is it itself an incremental residual update as well? 3) For question (2) I currently assume that the PSF grid changes only outer loop times (i.e., it is not a cumulative residual update within the inner loop), as that in my testing gave better results. However, if that is the case then should the evaluation case should also not be a cumulative residual, but rather a set of N individual residual computations? This swap from sampling shift to ePSF evaluation makes less sense for non-cumulative residual fitting, making me doubt my initial answer to (2)... 4) for question (1), is the original v0.6 recentering_maxiters loop the equivalent to the 2016 WFC3 ISR's dx, dy shift centroiding? I.e., don't just trust the centroid and accept the position, but keep going until you find the best one?

In summary: I now have too many fitting combinations which seem to produce decent ePSFs (well, good under the circumstances) for a rubbish dataset (no dithers, 16 stars, random placings, noise in the image), but am now slightly confused on details between different versions of the ePSF fitting algorithm in the literature and would appreciate some clarifications from people (I'm guessing @larrybradley and @eteq, possibly Jay) who were involved in the v0.6 code writing to correct what could just be a half-and-half mix of two apparently incompatible algorithm flows.

If the WFC3 ISR represents an update to the algorithm overall (and certainly with hindsight a single centroiding shift acceptance seems a little silly given that we don't blindly trust the centroids of the star positions in the first place) then I'll just put the recentering part of the code back to its v0.6 state (single loop, iterate over centroid position), but if it's worth looking at the sampling shift within an inner residual loop I can continue to work on that too. If the _recenter_epsf codeblock does get reverted we may need to look into whether we still want centroid_epsf as I had to hold its hand quite a lot to get it to work with box_size within the old recentering function, but that's a discussion for another day...

Onoddil commented 4 years ago

After a quick chat to @eteq it appears this is a case of missing history of the code project resulting in some unforeseen bugs that got past review.

I reduced the number of recentering combinations to 3 by assuming that PSF evaluation recenterings should always have a final recenter (because it doesn't make much sense that they don't, unlike the sampling shift recenter), which reduced the test to evaluation+iterative centroid, evaluation+single centroid, and sampling shift loop.

Assuming that the sampling shift loop (the (1) loop in AK00 Figure 8) is itself a form of recentering refinement, similarities then become fairly obvious between the two algorithms. Thus, following @eteq's information that the changes to the loop structure came from Jay I can therefore only assume that the ISR is (obviously) an update to the algorithm; I therefore will put the _recenter_epsf structure back to match as closely as possible that part of the v0.6 codebase, rather than pursue the sampling shift recentering any further, now that I have context for (and a citation to) the alterations to the algorithm.

The current metrics are still fairly useful for 'goodness of fit vs input parameter' studies so I'll keep going with the tests and condense the information into some more easily digestible forms (currently just a load of residual plots near one another...).

There just remains the issue of the centroiding algorithm itself. The v0.6 code just used centroid_com, but part of the changes I made were to add centroid_epsf. Is this "move the center to get the two pixels either side of the central pixel to be equal" also superseded by the ISR's 'calculate asymmetry' loop? Should we just always use the CoM algorithm? The problem arises with the box_size used in centroid_func to only calculate the centroid on the inner core of the ePSF, which could be awkward to handle with centroid_epsf not wanting a small cutout, but instead caring about individual pixels and their derivatives wrt x/y. Should we instead (/also) offer centroid_asymmetry, the centroiding algorithm from the WFC3 ISR? That would accept a box_size cutout.

To summarise again: with hindsight and context, it appears I did indeed halfway code between two algorithms, producing unexpected results; I will therefore concede to the newer algorithm and change the recentering to closer match that flow. I will continue to test the precision/accuracy of the fits for test Gaussians as well.

astropy / photutils

Map out parameter choices where EPSFBuilder fails, document it clearly, and give heuristic warnings #960