Investigate subtracting AGNs from existing Run2.1i images

LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.

BSD 3-Clause "New" or "Revised" License

11 stars 7 forks source link

Investigate subtracting AGNs from existing Run2.1i images #362

Closed jchiang87 closed 4 years ago

jchiang87 commented 4 years ago

As a possible mitigation for handling the overly bright AGNs that were added to the centers of galaxies in the Run2.1i data, it's been proposed to simulate the AGNs only, using the realized fluxes in the centroid files, so that the point-like contributions from these objects can be subtracted from the existing Run2.1i images. Discussion on this proposal began in the #desc-dc2-fluxes channel and continued in the #desc-dc2-agn channel.

I'll use this issue to post results from this investigation.

jchiang87 commented 4 years ago

To lay the groundwork for the problem this issue attempts to address, I'll re-post a couple of images from the #desc-dc2-agn channel. The first is a gif that blinks between part of a Run2.1i sensor visit (v398414-r R22_S11) and the same image with the AGNs subtracted: v398414-r_R22_S11_blink This shows that the added AGNs are fairly bright and ubiquitous, especially for the fainter galaxies.

Next I posted a somewhat misleading image that blinks between two difference images: the first is the Run2.1i image minus the Run2.1.1.i image, where the latter was simulated without the AGNs, and the second is my version of the Run2.1.1i image minus the AGN-subtracted image (= the Run2.1i image minus an AGN-only simulation): v398414-r_R22_S11_resids_blink The reason that this is misleading is that I neglected to set the random seed in my version of the Run2.1.1i image so that the sky backgrounds are not pixel-wise identical and the residuals from the AGN subtraction are hidden in the noise.

Here is gif that blinks between the mosaicked Run2.1i raw image and the difference between the calexp for the production Run2.1.1i image (i.e., with the correct seed) and the calexp run on the difference image of Run2.1i minus the AGN-only simulation: v398414-r_R22_S11_Run2 1i_raw_calexp_diff Since the same seed was used in the Run2.1i and Run2.1.1i images, the sky backgrounds match and most of the pixels in the residual image are near zero (the calexps have slightly different fitted background levels).

jchiang87 commented 4 years ago

I generated AGN-only raw images for all CCDs in r-band visit 398414, subtracted them from the Run2.1i images, and then ran processCcd.py on these AGN-subtracted sensor-visits (hereafter agn_sub). I also ran processCcd.py on the corresponding Run2.1.1i data, and found all positional matches within 10 arcsec between the two source catalogs over the entire focal plane.

Unfortunately, the numbers of sources identified as point or extended sources in the agn_sub data versus the Run2.1.1i differ substantially:

dataset	point source	extended	total
`agn_sub`	33289	295388	328677
`Run2.1.1i`	38810	289867	328677

Looking at the psfFlux pulls vs psfFlux shows some puzzling behavior. v398414-r_pull_vs_psfFlux For the blue points, I've selected all matched sources that are identified as point sources in both datasets, and for the red, I've similarly selected all sources identified as extended in both as well. v398414-r_psfFlux_pull_hist The pull distributions show that most sources actually do lie along the zero-pull line, but the widths of Gaussian functions fitted to the central parts of the distributions, while less than 1, are probably still larger than we would hope for.

Here are corresponding plots for the gaussianFlux values: v398414-r_pull_vs_gaussianFlux v398414-r_gaussianFlux_pull_hist

Plotting the correlated distributions of the shapeHSM_e1 and shapeHSM_e2 parameters for the matched extended sources does show a strong correlation, but there are still significant tails off of the main diagonal: v398414-r_shapeHSM_e1_agn_sub_vs_Run2 1 1i v398414-r_shapeHSM_e2_agn_sub_vs_Run2 1 1i

johannct commented 4 years ago

Is there a reason not to blink the difference between raw images of 2.1.1 and 2.1-agn? On the second gif you seem to have opted for blinking the difference in calexp, but first what is the situation of the raws when the same seed is used? Sorry if I missed this .... I presume that it is impossible to have exactly the same simulation as 2.1 when separating the sequence of random numbers in 2.1.1 and agn only, right?

jchiang87 commented 4 years ago

I presume that it is impossible to have exactly the same simulation as 2.1 when separating the sequence of random numbers in 2.1.1 and agn only, right?

That's right. There are two issues: 1) We don't have individual seeds per object, so the random sequences diverge immediately, 2) the AGN-only sim does not have the same B/F effects since there are much fewer e- / pixel without the other objects and sky bg being rendered.

rmandelb commented 4 years ago

Hi Jim - thanks for sharing these plots of the impact of the subtraction on source counts/fluxes/shapes. I have some questions:

I was wondering what selection criteria you imposed to identify sources before doing the positional matching? In particular, what I'm wondering is if differences could arise due to the many objects that are near the detection limit, versus arising for objects that are robustly detected?
Have you compared the PSF model shapes/sizes? My concern is that oddities in the PSF model photometry could indicate that the PSF models themselves are somehow getting messed up and are confounding the comparison. I'm not sure why that would occur, but the plots of pull versus PSF flux look quite strange, so I thought this might be worth checking.
Can you please remind me of the flux zero point? I'm looking at the plots as a function of flux and trying to mentally re-interpret them in terms of magnitudes.
You know the fluxes of the injected AGN component, so I was wondering about correlating the pull against the injected AGN flux for a given galaxy, and/or against the ratio of injected AGN flux vs. true galaxy flux? That could be useful in understanding failure modes for the subtraction. I'm very curious what is causing the power-law tails in the pull distribution, for example.
Do you have a sense for at what level of injected AGN flux would the injected AGN flux + sky result in significant b/f, whereas the injected AGN flux alone would not? I was wondering about doing some kind of trick like simulating a noise-free sky (i.e., literally lay down the sky level into the image without shooting photons, so as to avoid adding sky noise), simulating the AGN, and then subtracting the originally-added sky to get a better version of an "AGN only" image compared to just simulating the AGN. Is that easily doable? Would it give a higher fidelity subtraction, or are there too many objects where all of sky+galaxy+AGN are needed?

If any of this is non-trivial, don't worry about it, but I thought understanding the above would help us draw some conclusions about the viability of subtraction (not looking promising based on these plots, I admit).

jchiang87 commented 4 years ago

(oops, I somehow ended up deleting my original post. Here it is again:)

Hi Rachel, I can provide a couple of answers right now:

I was wondering what selection criteria you imposed to identify sources before doing the positional matching?

I just made some basic cuts based on the various flags in the source catalog:

                   'deblend_skipped == False',
                   'base_PixelFlags_flag_edge == False',
                   'base_PixelFlags_flag_interpolatedCenter == False',
                   'base_PixelFlags_flag_saturatedCenter == False',
                   'base_PixelFlags_flag_crCenter == False',
                   'base_PixelFlags_flag_bad == False',
                   'base_PixelFlags_flag_suspectCenter == False',
                   'ext_shapeHSM_HsmShapeRegauss_flag == False'

Can you please remind me of the flux zero point? I'm looking at the plots as a function of flux and trying to mentally re-interpret them in terms of magnitudes.

For these r-band observations, the zero-point is ~32.17. I can remake the plots in magnitude.

I'll try to address your other questions over the next couple of days.

rmandelb commented 4 years ago

Hmmm, based on these cuts, you could be digging into the noise floor. And with a 10 arcsec match, I could imagine some genuine mismatches that are driving the oddities in the point sources on the left-hand side of the PSF flux plot. Is it possible to re-check the number of detections and the plots with a criterion like "Gaussian flux S/N>10" and a tighter positional match?

Based on the zero point, it seems that the wonky branches in the PSF flux plot are for objects brighter than 10^4 counts or around 22nd magnitude (when simulated without an AGN), and the sign of the effect is that when subtracting the AGN-only image, the branches correspond to extended sources that are too bright and point sources that are too faint by similar amounts? (hence the approximate mirror images)

Do you know what fraction of the objects change classification from point vs. extended source?

jchiang87 commented 4 years ago

Thanks a lot for this last comment! I somehow got it into my head that I was making a 10 mas match and not the 10 arcsec match I actually did (even though I typed 10 arcsec above). Making the 10 mas as I originally intended, things look more reasonable (though still a bit disappointing): v398414-r_pull_vs_psfFlux v398414-r_psfFlux_pull_hist v398414-r_pull_vs_gaussianFlux v398414-r_gaussianFlux_pull_hist v398414-r_shapeHSM_e1_agn_sub_vs_Run2 1 1i v398414-r_shapeHSM_e2_agn_sub_vs_Run2 1 1i And here are the point source vs extended numbers:

dataset	point source	extended	total
agn_sub	14235	46050	60285
Run2.1.1i	14333	45952	60285

I'll follow-up with the S/N cut and some of your other suggestions tomorrow.

cwwalter commented 4 years ago

Thinking about the long tails:

Since the pull plots are made based on processCCD output and thus the calexps, I wonder if for the brighter objects anything is approaching the 100,000 count level? In that case DM is going to cutoff and interpolate the pixels in the CCD and the subtraction process definitely won't work for several reasons.

cwwalter commented 4 years ago

Is the pixel interpolation mask flag propagated into the catalog information?

cwwalter commented 4 years ago

Ah... probably

base_PixelFlags_flag_interpolatedCenter == False

would get rid of all of these...

rmandelb commented 4 years ago

Thanks for the updated plots, @jchiang87 ! That looks more promising. And with these results, some of my previous suggestions (e.g., PSF model tests) are no longer relevant. I think the ones that I'm still most curious about include:

The selection criteria question: if you put in a S/N>10 cut, do you still get a (significant) difference in the numbers of detections in agn_sub versus Run 2.1.1i?
You know the fluxes of the injected AGN component, so I was wondering about correlating the pull against the injected AGN flux for a given galaxy, and/or against the ratio of injected AGN flux vs. true galaxy flux? That could be useful in understanding failure modes for the subtraction. I'm very curious what is causing the power-law tails in the pull distribution, for example.

jchiang87 commented 4 years ago

The selection criteria question: if you put in a S/N>10 cut, do you still get a (significant) difference in the numbers of detections in agn_sub versus Run 2.1.1i?

Adding the S/N > 10 cut does reduce the disparity in numbers of point sources vs extended somewhat. Here is the corresponding table of detections:

dataset	point source	extended	total
agn_sub	13040	38299	51339
Run2.1.1i	13079	38260	51339

For the point sources, there is now a 0.3% (=2*39/(13040+13079)) disparity versus 0.7% without any S/N cut.

You know the fluxes of the injected AGN component, so I was wondering about correlating the pull against the injected AGN flux for a given galaxy, and/or against the ratio of injected AGN flux vs. true galaxy flux?

This will take a little more work since I need to match the source catalog entries from processCcd.py against the centroid file and instance catalog entries that have the AGN fluxes and coordinates. This will be forthcoming.

rmandelb commented 4 years ago

I might just be jumping to conclusions, but are those the numbers after matching? (which I'm inferring from the fact that the total is the same in both datasets... seems an unlikely coincidence that they would match so precisely?) If so, then I think it would be useful to see the numbers after selection criteria but before matching. Rationale:

If some insufficiency in the subtraction makes it easier/harder to detect some population in agn_sub than Run 2.1.1i, the matching after imposing S/N>10 might hide that problem.
Or if some insufficiency in the subtraction messes with the centroid estimates (not sure why that would happen but ...) then the number of detections versus matches could differ significantly.

Either of those problems would be good to know about. So I'd like to see the numbers pre-matching and post-matching to try to test for (or rule out!) a more complete set of problems.