Masking differences between specter and gpu_specter

marcelo-alvarez commented 1 year ago

The value of chi2pix is different between specter and gpu_specter causing the BAD2DFIT mask to differ when chi2pix > 100 (see desispec/py/scripts/extract.py) in specter but chi2pix < 100 in gpu_specter.

Example given below. @dmargala @sbailey any idea what's going on here?

bit7-20220613-00139620-z4-fiber-181

import fitsio
import numpy as np
import matplotlib.pyplot as plt

prepath1="/global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-cori-haswell/exposures/"
prepath2="/global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-cpu/exposures/"
prepath3="/global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-gpu/exposures/"

def showframes(night,exp,cam):
    offset=0.03

    path1=prepath1+night+"/"+exp+"/"
    frame="frame-"+cam+"-"+exp+".fits.gz"

    mask1=fitsio.read(prepath1+night+"/"+exp+"/"+frame,"MASK")[181]
    mask2=fitsio.read(prepath2+night+"/"+exp+"/"+frame,"MASK")[181]
    mask3=fitsio.read(prepath3+night+"/"+exp+"/"+frame,"MASK")[181]

    chi2pix1=fitsio.read(prepath1+night+"/"+exp+"/"+frame,"CHI2PIX")[181]
    chi2pix2=fitsio.read(prepath2+night+"/"+exp+"/"+frame,"CHI2PIX")[181]
    chi2pix3=fitsio.read(prepath3+night+"/"+exp+"/"+frame,"CHI2PIX")[181]

    print(prepath3+night+"/"+exp+"/"+frame)
    b71=(mask1 >> 7) & 1
    b72=(mask2 >> 7) & 1
    b73=(mask3 >> 7) & 1
    low=np.where(b71==1)[0].min()
    high=np.where(b71==1)[0].max()
    b71 = b71 * chi2pix1.max()
    b72 = b72 * chi2pix2.max()
    b73 = b73 * chi2pix3.max()
    chi2pixmax=chi2pix1[low:high+1].max()
    w1=np.arange(len(mask1)); w2=w1+offset; w3=w1-offset
    plt.clf()
    plt.plot(w1,b71,'o',c='k',mfc='none',label='cori')
    plt.plot(w2,b72,'o',c='r',mfc='none',label='perlmutter-cpu')
    plt.plot(w3,b73,'o',c='b',mfc='none',label='perlmutter-gpu')     
    plt.plot(w1,b71,drawstyle='steps-mid',c='k',label='BADFIT')
    plt.plot(w2,b72,drawstyle='steps-mid',c='r')
    plt.plot(w3,b73,drawstyle='steps-mid',c='b')
    plt.plot(w1,chi2pix1,c='k',ls=':',label='chi2pix')
    plt.plot(w2,chi2pix2,c='r',ls=':')
    plt.plot(w3,chi2pix3,c='b',ls=':')
    plt.plot(w1,chi2pix1,'o',mfc='none',c='k')
    plt.plot(w2,chi2pix2,'o',mfc='none',c='r')
    plt.plot(w3,chi2pix3,'o',mfc='none',c='b')
    plt.plot(w3,chi2pix3/chi2pix1*1000,c='grey',label='chi2pix ratio x 1000')
    plt.title("night: "+night+"   exp: "+exp+"   cam: "+cam+"   fiber: 181")
    plt.gca().set_xlabel('wavelength bin')
    plt.gca().set_ylabel(r'chi2pix')
    plt.gca().set_xlim([low-(high-low+1)*0.5,high+(high-low+1)*0.9])
    plt.gca().set_ylim([-30,chi2pixmax*1.1])

    plt.legend()
    plt.savefig("bit7-"+night+"-"+exp+"-"+cam+"-fiber-181.png",bbox_inches="tight",dpi=500)
showframes("20220613","00139620","z4")

sbailey commented 1 year ago

No immediate answer, but a few bits of information:

chi2pix is tracking the goodness of fit of the PSF model to the CCD pixels covered by each wavelength bin. The reason is that unmasked cosmics or other CCD defects won't be well fit by the PSF, leading to bad extractions. If we know that the PSF is a poor model for whatever those pixels are, then we can flag them at the spectrum level even if we failed to flag them at the CCD level.

In gpu_specter, the chi2pix calculation comes from extract/gpu.py lines 578-580 used by gpu.py line 632. The equivalent code in original specter is extract/ex2d.py lines 248-260.

I suspect that the discrepancy is coming from the second portion of that original code:

chi2x = (A.T.dot(chi.ravel()**2) * ~bad) / (psfweight + bad)
chi2x = chi2x.reshape(subnspec, subnwave)
...
chi2pix[iispec[keep], iwave:iwave+wavesize] = chi2x[keep, nlo:-nhi]

I haven't re-parsed all of the indexing, but it looks like the gpu_specter code might be doing the equivalent of chi2x, but not sub-selecting the region that applied to the "core" good region of the patch extraction saved into chi2pix.

I think a symptom of that would be bands in wavelength where chi2pix agrees more or less between specter and gpu_specter. i.e. near the subpatch boundaries (50 wavelength bins = 40 Angstrom) there would be less agreement, and in the middle of sub-patches there would be better agreement.

marcelo-alvarez commented 1 year ago

No immediate answer, but a few bits of information:

chi2pix is tracking the goodness of fit of the PSF model to the CCD pixels covered by each wavelength bin. The reason is that unmasked cosmics or other CCD defects won't be well fit by the PSF, leading to bad extractions. If we know that the PSF is a poor model for whatever those pixels are, then we can flag them at the spectrum level even if we failed to flag them at the CCD level.

In gpu_specter, the chi2pix calculation comes from extract/gpu.py lines 578-580 used by gpu.py line 632. The equivalent code in original specter is extract/ex2d.py lines 248-260.

I suspect that the discrepancy is coming from the second portion of that original code:
chi2x = (A.T.dot(chi.ravel()**2) * ~bad) / (psfweight + bad)
chi2x = chi2x.reshape(subnspec, subnwave)
...
chi2pix[iispec[keep], iwave:iwave+wavesize] = chi2x[keep, nlo:-nhi]
I haven't re-parsed all of the indexing, but it looks like the gpu_specter code might be doing the equivalent of chi2x, but not sub-selecting the region that applied to the "core" good region of the patch extraction saved into chi2pix.

I think a symptom of that would be bands in wavelength where chi2pix agrees more or less between specter and gpu_specter. i.e. near the subpatch boundaries (50 wavelength bins = 40 Angstrom) there would be less agreement, and in the middle of sub-patches there would be better agreement.

Thanks @sbailey for this. I plotted the chi2pix for cori (black; overlaps perlmutter-cpu, red) and perlmutter-gpu (blue) as well as the ratio of the perlmutter-gpu chi2pix to the cori chi2pix (grey). All three chi2pix curves are shifted upward by 1e3 for clarity, so the threshold for BAD2DFIT here is at a value of 1e5 (100 x 1e3).

As you can see, there are no bands where the chi2pix values agree (ratio not equal to 1 for the grey curve). Typical values of chi2pix for gpu are about 0.5 of those for cpu, while there are spikes where chi2pix is ~10 times higher for gpu than cpu. Finally, at the problematic masked region at bin ~1570, the ratio of gpu chi2pix to cpu chi2pix is even lower, at ~0.3.

bit7-20220613-00139620-z4-fiber-181

There seems to be some periodicity in the gpu chi2pix spikes that is absent in the cpu version, but I don't really know what to make of it.

marcelo-alvarez commented 1 year ago

I haven't re-parsed all of the indexing, but it looks like the gpu_specter code might be doing the equivalent of chi2x, but not sub-selecting the region that applied to the "core" good region of the patch extraction saved into chi2pix.

@sbailey if the padding is really 2d (as implied by the keep in the first dimension of chi2pix; I didn't look at the intermediate code in the ...) then wouldn't this imply that nearly every wavelength bin on fibers on either one or both sides of subbundle egdes would be assigned (erroneously) a chi2pix value from a padded region?

If so, then your 1d logic in the wavelength direction for regions that agree well far from the padding would apply not just to bins, but also to fibers themselves. But maybe fibers are so far apart on the ccd that there is no padding of the subbundles...

sbailey commented 1 year ago

The padding is 2D, i.e. each patch includes boundary fibers and boundary wavelengths not included in the final solution. However, I think fiber 181 from your example is an "interior" fiber in a subbundle of size 5 so I don't see how it would be impacted by that. It looks like we'll need a deep dive comparing both codes side by side for the values of chi, bad, psfweight for a single patch to identify where the ~2x discrepancy is coming from.

dmargala commented 1 year ago

There are a few of differences in the code paths for the calculation ofchi2pix between gpu_specter and specter. Some of them are subtle and I don't expect them to matter but still need to double check that. So far, the most significant difference in the code paths seems to be:

using reconvolved extracted flux (gpu_specter) vs raw extracted flux (specter) for the image pixel model

The image pixel model is used to in the calculation of chi2pix and is computed with:

model = A.dot(flux)

where A is the full projection matrix for the patch with shape (ny*nx, nspectot*nwavetot) and flux is the extracted flux with shape (nspectot*nwavetot,).

sbailey commented 1 year ago

Good catch. chi2pix should be calculated with raw extracted flux (like specter), not the reconvolved flux = R * xflux. Projecting to 2D pixels with A is convolving with the 2D PSF, while the reconvolved flux using R is doing the equivalent in 1D, so using that for chi2pix is effectively double convolving.

However, I would have expected using the reconvolved flux to make the chi2 worse, not better. So this may not be the entire explanation, but at minimum it is something that needs to be fixed. Let's get that fixed and then re-evaluate the chi2pix ratio and continue debugging from there.

@marcelo-alvarez @dmargala is the un-reconvolved flux available in memory at the point that chi2pix is calculated, or will this require some dataflow surgery?

dmargala commented 1 year ago

Okay, I can submit a PR with a fix for that in gpu_specter. It will require a little data surgery in both the cpu and gpu code paths.

sbailey commented 1 year ago

PR #76 was a step in the right direction (calculating chi2pix with raw extracted xflux instead of 1D reconvolved flux), but discrepancies remain. To avoid confusion from rounding errors etc. on different systems, let's focus our chi2pix comparisons on:

classic specter run on Perlmutter CPU, vs
gpu_specter cpu-mode on Perlmutter CPU

i.e. not Cori, and not GPUs, which gives us the closest hardware comparison for testing classic specter with gpu_specter algorihtms. When we get those to agree closely, then we can expand to comparing with gpu_specter gpu-mode. I think Marcelo already demonstrated close agreement between classic specter on Cori CPU vs. classic specter Perlmutter CPU, but a lot of comparisons have been flying by so I'd also like us to double check that when the dust settles.

marcelo-alvarez commented 1 year ago

This makes sense @sbailey. Comparison of these quantities for gpu_specter CPU vs gpu_specter GPU, on the main branch, has not yet been done in the main DESI environment on Perlmutter (either before or after PR #76) to my knowledge. I would assume they give the same answer up to hardware differences, but will be good to check consistency in the code paths, particularly for chi2pix.

dmargala commented 1 year ago

I have discovered that the desi_extract_spectra does not propogate the --psferr argument to specter's ex2d. It is passed through to gpu_specter. specter is looking up psferr from the psf object and seems to be using a value of 0.01 for this frame instead of the value of 0.1 supplied on the command line. psferr is used to compute modelivar which is used the chi2pix calculation.

# generate data on perlmutter cpu node
salloc -A nstaff -C cpu -N 1 -q interactive -t 30
source /global/common/software/desi/desi_environment.sh main
srun -n 32 -c 2 --cpu-bind=cores desi_extract_spectra -w 7520.0,9824.0,0.8 -i /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-cpu/preproc/20220404/00128680/preproc-z5-00128680.fits.gz -p /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-cpu/exposures/20220404/00128680/psf-z5-00128680.fits -o /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-gpuspecter-pmcpu-psferr-0.01.fits.gz --psferr 0.01 --gpu-specter --nsubbundles 5 --mpi --barycentric-correction
srun -n 32 -c 2 --cpu-bind=cores desi_extract_spectra -w 7520.0,9824.0,0.8 -i /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-cpu/preproc/20220404/00128680/preproc-z5-00128680.fits.gz -p /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-cpu/exposures/20220404/00128680/psf-z5-00128680.fits -o /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-gpuspecter-pmcpu-psferr-0.1.fits.gz --psferr 0.1 --gpu-specter --nsubbundles 5 --mpi --barycentric-correction
srun -n 20 -c 2 --cpu-bind=cores desi_extract_spectra -w 7520.0,9824.0,0.8 -i /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-cpu/preproc/20220404/00128680/preproc-z5-00128680.fits.gz -p /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-cpu/exposures/20220404/00128680/psf-z5-00128680.fits -o /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-specter-pmcpu-psferr-0.1.fits.gz --psferr 0.1 --nsubbundles 6 --mpi --barycentric-correction
srun -n 20 -c 2 --cpu-bind=cores desi_extract_spectra -w 7520.0,9824.0,0.8 -i /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-cpu/preproc/20220404/00128680/preproc-z5-00128680.fits.gz -p /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-cpu/exposures/20220404/00128680/psf-z5-00128680.fits -o /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-specter-pmcpu-psferr-0.01.fits.gz --psferr 0.01 --nsubbundles 6 --mpi --barycentric-correction

import fitsio
import matplotlib.pyplot as plt

plt.rcParams.update({'font.size': 14})

fig, ax = plt.subplots(nrows=1, figsize=(12, 4))

def plot_frame_hdu_slice(ax, path, hdu="CHI2PIX", fiber=181, waveslice=slice(1400, 1450), **kwargs):
    wave = fitsio.read(path, "WAVELENGTH")
    hdu = fitsio.read(path, hdu)[fiber]
    ax.plot(wave[waveslice], hdu[waveslice].T, **kwargs)

pmcpu_specter = "/global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-specter-pmcpu-psferr-0.1.fits.gz"
pmcpu_specter_psferr001 = "/global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-specter-pmcpu-psferr-0.01.fits.gz"
pmcpu_gpu_specter = "/global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-gpuspecter-pmcpu-psferr-0.1.fits.gz"
pmcpu_gpu_specter_psferr001 = "/global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-gpuspecter-pmcpu-psferr-0.01.fits.gz"

plot_frame_hdu_slice(ax, pmcpu_specter, label='specter (psferr=0.1)', marker='+', lw=0, color='C0', mfc='none')
plot_frame_hdu_slice(ax, pmcpu_specter_psferr001, label='specter (psferr=0.01)', marker='s', lw=0, color='C0', mfc='none')
plot_frame_hdu_slice(ax, pmcpu_gpu_specter, label='gpu_specter (cpu) (psferr=0.1)', marker='x', lw=0, color='C1', mfc='none')
plot_frame_hdu_slice(ax, pmcpu_gpu_specter_psferr001, label='gpu_specter (cpu) (psferr=0.01)', marker='o', lw=0, color='C1', mfc='none')

ax.set_ylabel('chi2pix')
ax.set_xlabel('wavelength')
ax.grid()
ax.legend()

marcelo-alvarez commented 1 year ago

Thanks @dmargala. If I understand correctly, the intended value of psferr (according to the command line) is 0.1, and with this value the chi2pix does not exceed 100 for fiber 181 on exposure 128680 around 8650, and so the flux would not have been masked in this region for specter. However, the visual inspection of fiber 181 on exposure 128680 by @julienguy indicated that not masking resulted in the wrong redshift.

@sbailey could it be that the more appropriate value that should be passed in is 0.01, as is apparently hardwired into specter, and that the threshold of 100 for chi2pix was chosen empirically with psferr=0.01? Alternatively, one might consider a lower threshold than 100 if psferr=0.1 is the more appropriate value to use. This hinges on 1) the current specter masking (with psferr=0.01) giving the correct redshift for fiber 181 on 128680 and 2) it being generally the case that using psferr=0.1 with a threshold of 100 is too conservative and leads to spurious redshifts in most of the cases where the redshifts differ between psferr=0.01 and 0.1.

julienguy commented 1 year ago

We should use the threshold that has been used for the fuji and guadalupe reductions. So psferr=0.01.

marcelo-alvarez commented 1 year ago

So we should either 1) fix specter so that it actually uses the psferr value passed on the command line and then change the value being passed in to 0.01 (or make 0.01 the default, if it isn't already, and leave it off the command line) or 2) drop the psferr argument from desi_extract_spectra and hard code psferr=0.01 to both specter and gpu_specter. It seems safer to keep psferr hardcoded to me, since doing so would have avoided this problem and it seems like the value used will change very infrequently if at all going forward. But either option is fine with me.

sbailey commented 1 year ago

Thanks for tracking this down. Summarizing my understanding:

Both specter and gpu_specter already use psferr if provided. If it isn't provided, they fall back to the psf file PSFERR keyword (which isn't set in current DESI PSFs), and lastly resort to a default of 0.01. The bug leading to the discrepancy was on the desi_extract_spectra side, which was passing the --psferr option to gpu_specter but not specter, and furthermore it was using an incorrect 0.1 value instead of 0.01 which we were accidentally correctly using. The result was that gpu_specter was using 0.1 (incorrect psferr, but that's what we told it to use) and specter was using 0.01 (correct, but only accidentally because we failed to ask it to use the incorrect value...)

desihub/desispec #1855 fixes this, and I verified by remaking Daniel's plot showing that specter and gpu_specter now agree on chi2pix and respect the --psferr option:

sbailey commented 1 year ago

@marcelo-alvarez: after PRs gpu_specter #76 and desihub/desispec#1855, the chi2pix agreement appears much better than before. Please repeat your comparison on chi2pix and what gets masked and then hopefully we can return to the big picture comparison of redshifts.

marcelo-alvarez commented 1 year ago

I've verified that chi2pix is now nearly identical after after PRs gpu_specter https://github.com/desihub/gpu_specter/pull/76 and https://github.com/desihub/desispec/pull/1855, with no difference in masking. This issue has been resolved.

desihub / gpu_specter

Masking differences between specter and gpu_specter #75