LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Low density on tract 4852/patch 1,5 on Run2.2i DR6 WFD #408

Closed plaszczy closed 3 years ago

plaszczy commented 3 years ago

tract_chelou

plaszczy commented 3 years ago

That was confirmed by @fjaviersanchez :it was eliminated because of the "good" selection. because of ovjects that had 'base_PixelFlags_flag_clipped'. It is not present on DR2.

johannct commented 3 years ago

see https://lsstc.slack.com/archives/CM6MF33UG/p1607440268069500 for more on this. I tried to track down a pathological calexp but failed. In the meantime it appears that the coaddDriver outputs have inconsistent timestamp, so I am going to rerun it, keeping the warps but removing the final products so that they possibly get updated.

wmwv commented 3 years ago

Thanks, @johannct . I am very interested in the results of a re-run. The special behavior of this patch makes me think that there was a temporary configuration problem, or that this patch was run with a different configuration than all the others.

johannct commented 3 years ago

Before setting the coaddDriver output I just chcked the supreme exposure map for 4852, and it looks OK, so whatever happened to this patch is of a different nature than the previously spotted failures dc2_testing_04852_i_exptime_sum

wmwv commented 3 years ago

@fjaviersanchez found: """ the hole there in DR6 corresponds to objects with 'base_PixelFlags_flag_clipped' that have been eliminated when you apply the good cut (btw, I also checked DR2 and there's no hole there when you apply the good cut). Below is the plot of DR6 galaxies in the tract that @stefplaz mentioned with base_PixelFlags_flag_clipped==True : """

image (1)

wmwv commented 3 years ago

@fjaviersanchez continues: """ (my map is flipped with respect to Stéphane's)

and a higher resolution zoom-in: """

image (2)

wmwv commented 3 years ago

@johannct notes that: """ update : looking at the occurrence of the value True for this flag in 'deepCoadd_meas' I get u, 32 g, 0 r, 26 i, 29134 z, 32 y, 27 So of course it is the reference band which is pathological..... """

""'" Looking at deepCoadd images it seems indeed that the mask for coadd i image includes an abnormal amount of pixels where CLIPPED (10) and APPROXIMATE PSF (12) are set.... """

wmwv commented 3 years ago

@johannct What numbers do you get for the occurrence of base_PixelFlags_flag_clipped in the next patch over? E.g., tract 4852, patch 1,4.

heather999 commented 3 years ago

Recall this issue: https://github.com/LSSTDESC/DC2-production/issues/400 where due to the processing at NERSC where we tried to reuse the existing warps from CC, it was found a number of warp files appear to be corrupt. Given the datestamps on the files, it seemed connected with the disk filling up at CC back in May. While I haven't completely finished, I do have a list of bad warps. ~A 130 of them in the i band for tract 4852, include patch 1,5 and 139 in patch 1,4 There are many other bad warps in 4852 for some other patches. I do wonder if these warps should be regenerated. I started to collect a list here at CC: /sps/lsst/users/hkelly/dr6-warps-checks There's a specific list for 4852 1,5 here: /sps/lsst/users/hkelly/dr6-warps-checks/bad-i-tract-4852-1,5.out~ Sorry those 130 were the good ones :) 4852 in i band actually looked fine

johannct commented 3 years ago

Hmm as far as I can tell there are 130 warp files in your list, and 130 warps in the rerun directory.... so that would mean that they are all bad somehow? It would be good to understand exactly what is wrong with them then, because fitsinfo does not complain for any of them.

jchiang87 commented 3 years ago

For many of the corrupted files, one needs to explicitly read in and access the data section of one of the image extensions to see an error. I'd be surprised if the files were corrupted at the time of the coadd generation since things would have crashed, but it's worth following up on. Regardless, corrupted files should be identified and moved out of the way.

johannct commented 3 years ago

yes I was hoping that supreme was doing that as well...... I do not understand. Ok so if you confirm that they should all be deleted I will remove them and relaunch coaddDriver a third time.

heather999 commented 3 years ago

See updated comment :) There are some bad warps, but not in 4852 i band. I'm going through the logs again though to see if the other bands show anything for that particular tract.

johannct commented 3 years ago

In [149]: d=[] ...: for f in ['u','g','r','i','z','y']: ...: id={'tract':4852,'patch':'1,4','filter':f} ...: dd=butler.get('deepCoadd_meas',dataId=id) ...: print('{} {}'.format(f,len(np.where(dd['base_PixelFlags_flag_clipped']==True)[0]))) ...: d.append(dd) ...:
...:
u 28 g 13 r 18 i 29 z 26 y 19

In [150]: d=[] ...: for f in ['u','g','r','i','z','y']: ...: id={'tract':4852,'patch':'1,5','filter':f} ...: dd=butler.get('deepCoadd_meas',dataId=id) ...: print('{} {}'.format(f,len(np.where(dd['base_PixelFlags_flag_clipped']==True)[0]))) ...: d.append(dd) ...:
...:
u 32 g 0 r 26 i 29134 z 32 y 27

erykoff commented 3 years ago

Has anybody opened all the warps? If an extension was all zeros or something that would be an issue and still be a valid fits file. Meanwhile, suprême does not open the individual warps in the mode that I ran on DC2, because this was orders of magnitude too slow due to where the warps live on the file system.

jchiang87 commented 3 years ago

The test code snippet I proposed opens each image extension and accesses the shape of the data arrays. That's sufficient to trigger the same error that we saw in the coaddition task. It didn't look at any array values.

erykoff commented 3 years ago

I think we need to look at the values themselves, because this is setting the CLIPPED flag so presumably there are numbers here but they are crazy outliers.

jchiang87 commented 3 years ago

I'll have a look.

johannct commented 3 years ago

Hmmm I took the last one randomly, to check how the code would go : In [1]: from astropy.io import fits
In [2]: hdu=fits.open('rerun/run2.2i-coadd-wfd-dr6-v1-grizy/deepCoadd/i/4852/1,5/warp-i-4852-1,5-995008.fits')
In [11]: hdu[1].shape,np.min(hdu[1].data),np.max(hdu[1].data)
Out[11]: ((4200, 4200), nan, nan)

erykoff commented 3 years ago

There can be nans if they're masked, I think you need np.nanmin

johannct commented 3 years ago

In [15]: hdu[1].shape,np.nanmin(hdu[1].data),np.nanmax(hdu[1].data)
Out[15]: ((4200, 4200), -459.5638, 131863.9)

wmwv commented 3 years ago

There are two questions:

  1. Is there one bad image?
  2. Why does one bad image propagate to all of the pixels in the coadd. The thresholds to map from individual image mask values to coadd mask values should be at least (?) 20%. Badness in one image shouldn't propagate like this.
johannct commented 3 years ago

ok looping now.....

johannct commented 3 years ago

@wmwv I think we are in paranoid check mode here, we turn all the stones

johannct commented 3 years ago

In [18]: for file in files : ...: hdu=fits.open(file) ...: print(hdu[1].shape,np.nanmin(hdu[1].data),np.nanmax(hdu[1].data)) ...:
(4200, 4200) -221.08604 133905.1 (4200, 4200) -214.76836 139060.38 (4200, 4200) -366.0075 132893.77 (4200, 4200) -228.26071 127752.06 (4200, 4200) -359.60248 123011.17 (4200, 4200) -410.39493 123648.72 (4200, 4200) -287.60092 134028.53 (4200, 4200) -297.4834 142707.14 (4200, 4200) -343.23798 134039.9 (4200, 4200) -333.2711 94684.02 (4200, 4200) -429.60797 134553.14 (4200, 4200) -240.8877 139433.05 (4200, 4200) -253.50287 132108.03 (4200, 4200) -313.1606 86863.28 (4200, 4200) -227.18558 137615.8 (4200, 4200) -313.3414 138924.62 (4200, 4200) -228.28221 138908.22 (4200, 4200) -233.07065 136456.36 (4200, 4200) -238.40208 140100.1 (4200, 4200) -359.60675 131295.97 (4200, 4200) -297.1659 130965.8 (4200, 4200) -296.82626 122973.234 (4200, 4200) -389.96426 126481.37 (4200, 4200) -209.39848 134179.42 (4200, 4200) -220.55402 135111.73 (4200, 4200) -225.01671 129270.61 (4200, 4200) -241.86803 136417.08 (4200, 4200) -263.54977 122629.26 (4200, 4200) -232.34193 123612.34 (4200, 4200) -242.83937 138155.8 (4200, 4200) -359.29358 139312.73 (4200, 4200) -384.5748 138924.84 (4200, 4200) -198.96458 997.3174 (4200, 4200) -327.1076 141224.86 (4200, 4200) -459.5638 131863.9 (4200, 4200) -218.9993 143118.56 (4200, 4200) -355.44714 131582.31 (4200, 4200) -335.10596 121496.414 (4200, 4200) -220.7141 141601.34 (4200, 4200) -224.05415 145167.67 (4200, 4200) -235.33553 139298.52 (4200, 4200) -348.3461 130177.41 (4200, 4200) -244.94116 113448.625 (4200, 4200) -407.3234 141324.77 (4200, 4200) -344.74213 99353.83 (4200, 4200) -279.4023 126072.96 (4200, 4200) -375.81897 136397.64 (4200, 4200) -384.42545 130244.15 (4200, 4200) -330.38043 141549.83 (4200, 4200) -285.0421 133313.84 (4200, 4200) -224.46022 147216.33 (4200, 4200) -342.903 132111.72 (4200, 4200) -221.27449 135182.61 (4200, 4200) -223.57861 129118.28 (4200, 4200) -245.40088 136870.05 (4200, 4200) -226.03519 124667.016 (4200, 4200) -365.07425 134581.2 (4200, 4200) -358.88992 131584.75 (4200, 4200) -361.09113 126800.15 (4200, 4200) -408.5616 137013.6 (4200, 4200) -276.30737 131298.9 (4200, 4200) -234.86754 121050.64 (4200, 4200) -230.76772 143862.1 (4200, 4200) -405.2909 135016.33 (4200, 4200) -383.11908 136644.48 (4200, 4200) -362.63193 130828.805 (4200, 4200) -361.85605 104724.445 (4200, 4200) -298.3697 139430.66 (4200, 4200) -349.76886 135952.77 (4200, 4200) -234.52693 131596.16 (4200, 4200) -450.40366 127424.664 (4200, 4200) -266.98944 148013.55 (4200, 4200) -239.36635 134346.94 (4200, 4200) -405.66418 139318.78 (4200, 4200) -407.34253 128650.664 (4200, 4200) -270.9116 132728.92 (4200, 4200) -232.91992 137628.5 (4200, 4200) -380.22607 139010.3 (4200, 4200) -390.6479 135483.84 (4200, 4200) -430.38373 103387.22 (4200, 4200) -369.58444 136016.9 (4200, 4200) -236.92389 133908.39 (4200, 4200) -343.90466 114249.07 (4200, 4200) -234.69402 135889.78 (4200, 4200) -275.37216 133662.61 (4200, 4200) -231.78294 139231.02 (4200, 4200) -306.9835 143125.11 (4200, 4200) -430.68283 135528.64 (4200, 4200) -256.94614 132847.14 (4200, 4200) -283.16025 142584.52 (4200, 4200) -253.36618 138652.23 (4200, 4200) -241.82642 122275.39 (4200, 4200) -241.08351 127364.86 (4200, 4200) -335.7297 128853.19 (4200, 4200) -269.1661 135810.47 (4200, 4200) -361.54834 126647.58 (4200, 4200) -243.64369 132435.92 (4200, 4200) -310.02405 145569.95 (4200, 4200) -303.22217 133875.17 (4200, 4200) -325.29352 134571.8 (4200, 4200) -358.8536 131591.53 (4200, 4200) -361.1473 133469.12 (4200, 4200) -298.18802 7121.0894 (4200, 4200) -413.1947 129937.836 (4200, 4200) -219.54723 131380.53 (4200, 4200) -329.9853 124610.19 (4200, 4200) -215.26097 127490.734 (4200, 4200) -215.66096 128880.81 (4200, 4200) -344.53 131975.08 (4200, 4200) -190.59291 49608.266 (4200, 4200) -263.8326 133168.27 (4200, 4200) -221.83551 136502.84 (4200, 4200) -333.5346 131394.84 (4200, 4200) -212.17023 127306.88 (4200, 4200) -230.33888 131760.94 (4200, 4200) -242.73006 135753.66 (4200, 4200) -368.48984 130084.7 (4200, 4200) -210.56142 136671.72 (4200, 4200) -219.70772 134464.67 (4200, 4200) -234.63768 136564.45 (4200, 4200) -207.26266 99546.4 (4200, 4200) -225.53851 143131.39 (4200, 4200) -308.30362 143474.28 (4200, 4200) -323.02667 124720.05 (4200, 4200) -275.3604 129618.72 (4200, 4200) -217.21942 139432.89 (4200, 4200) -301.8557 133558.06 (4200, 4200) -258.93634 146225.88 (4200, 4200) -223.41513 134532.22 (4200, 4200) -351.75363 136545.86

wmwv commented 3 years ago

@wmwv I think we are in paranoid check mode here, we turn all the stones

Agreed. The discussion so far had been focused on finding the bad image (item 1). My point was to also encourage consideration of Item 2, what went on in the coadd config.

johannct commented 3 years ago

so a priori there is no blatant issue with one of the warps

jchiang87 commented 3 years ago

@johannct Can you also check extensions 2 and 3? the variance and mask extensions?

johannct commented 3 years ago

extension 3 has nanmax systematically set to 'inf'

erykoff commented 3 years ago

I generally agree that there isn't a blatant issue, and I think that @wmwv makes a very good point that whatever is going wrong is propagating to all the pixels in the coadd, which is hard for one image to do! I wonder if it's also possible to rerun the coadd with the existing warps and see if the problem is still there? Presumably we wouldn't have to even run source detection or multiband, just look at the coadd mask plane.

erykoff commented 3 years ago

Extension 3 is ...?

johannct commented 3 years ago

mask? according to @jchiang87

jchiang87 commented 3 years ago

or variance, I can't remember which offhand, but the variance values should be obvious.

johannct commented 3 years ago

@erykoff , this is already running

johannct commented 3 years ago

or variance, I can't remember which offhand, but the variance values should be obvious.

typically (4200, 4200) 4162.3 inf for ext 3 typically (4200, 4200) 0 3104 for ext2

erykoff commented 3 years ago

Then 2 must be mask, 3 is inverse variance?

johannct commented 3 years ago

I do not know how a config could change for a single patch out of the blue..... @wmwv which datasetType would you look at? There is no deepCoadd_meas_config but there is a deepCoadd_forced_config

jchiang87 commented 3 years ago

There is one image that has a very different range of pixel values:

(4200, 4200) -198.96458 997.3174

Most max pixel values are ~100k. Maybe this is an outlier frame worth looking at?

erykoff commented 3 years ago

This is what the mask plane looks like on 1,5 (the bad one) and 1,6 (neighboring okay) from the repo at nersc. The bit 2**14=16384 is the "clipped" bit. And it is set almost everywhere on 1,5 and not at all on 1,6, and not following any of the input images. So something went 🤪 here.
image

erykoff commented 3 years ago

This is from, e.g, /global/cfs/cdirs/lsst/production/DC2_ImSim/Run2.2i/desc_dm_drp/v19.0.0-v1/rerun/run2.2i-coadd-wfd-dr6-v1-grizy/deepCoadd/i/4852/1,5.fits. Don't need to look at the sources/run multiband to see the problem.

jchiang87 commented 3 years ago

The pattern on the left indicates that it is a single image that is causing that bit to be set.

erykoff commented 3 years ago

The bit is set on both sides of a chip gap ... maybe a single visit, but not a single PVI/calexp/warp, no?

johannct commented 3 years ago

Good catch @jchiang87 here is the case you spotted is rerun/run2.2i-coadd-wfd-dr6-v1-grizy/deepCoadd/i/4852/1,5/warp-i-4852-1,5-893769.fits ds9

God switched on a light bulb....

jchiang87 commented 3 years ago

Yes, a single visit, but that would still correspond to a single warp image that combines the different PVI that overlap with it.

erykoff commented 3 years ago

@jchiang87 Ah yes, duh. @johannct seems problematic.

jchiang87 commented 3 years ago

@johannct I think that image is actually not the culprit. That looks like a visit where only a small corner of the warp was covered by a CCD. The one we want would look like the pattern on the coadd with that clipped bit set.

johannct commented 3 years ago

this is a bit tougher.... I have no better way than to open them all

erykoff commented 3 years ago

I think that the suprême input map can help here. Give me a moment...

jchiang87 commented 3 years ago

I think the pixels in chip gaps should all be nan-valued in the image itself, so the number of nans would match or be close to the number of non-clipped pixels in the coadd...could try comparing those numbers to see....

jchiang87 commented 3 years ago

Another test is that there are at least 5 CCDs contributing to that warp. 4 or fewer is more typical. I think the number of contributing CCDs is in the warp headers somewhere.