NaNs in extraction, icov not positive definite

dmargala commented 3 years ago

I processed 498 exposures (379 night-tiles) from 2021-04 with gpu_specter.

There were 772 patches with flux=NaN due to silent failure at the Cholesky solve step (icov is not positive definite). The NaNs are eventually zeroed out before writing to disk.

I hacked gpu_specter to save an image of the patches when extraction fails. It looks like nearly all the affected patches have several masked pixels. See:

https://data.desi.lbl.gov/desi/spectro/redux/dmargala/43429/nan/nan-table.html

For reference, here is a simplified version of the extraction code:

y = A.T.dot(Ninv).dot(p)
icov  = A.T.dot(Ninv).dot(A)
flux = solve(icov, y)

where p is the pixel vector, image.pix, Ninv is the inverse pixel noise matrix, diag(image.ivar*(image.mask == 0)), and A is the projection matrix. In practice, the diagonal entries of icov are modified when the corresponding pixel weights are below some threshold.

For one of the failures, I compared the icov and y between specter and gpu_specter(gpu). They are mostly similar, not identical because patch padding is a little different, but I don't see anything obvious out of place between the two. The icov from specter is not positive definite as well so the likely culprit is that the Cholesky solve in gpu_specter (cupyx.lapack.posv) is less forgiving then the solve in specter (scipy.sparse.linalg.spsolve).

Cholesky solve is used in gpu_specter because there is a batched implementation in CuPy/CUDA cuSolver which allows an entire sub-bundle of patches to be extracted in parallel on the GPU.

Note that there are NaNs from specter extraction in the everest logs as well but they are not as common. I count 10 from specter vs 772 from gpu_specter for 2021-04 data:

> find /global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/ -name prestdstar-*log | xargs -l grep -H flux=NaN
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210101/prestdstar-20210101-00070356-a0123456789-43973788.log:ERROR: spectra 394:400 wavelengths 8350.4:8408.8 masking 250 flux=NaN pixels (35.0% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210519/prestdstar-20210519-00089266-a0123456789-43966404.log:ERROR: spectra 225:230 wavelengths 9672.3:9727.5 masking 200 flux=NaN pixels (0.0% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210428/prestdstar-20210428-00086503-a0123456789-43964538.log:ERROR: spectra 119:125 wavelengths 6231.3:6287.3 masking 250 flux=NaN pixels (4.3% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210221/prestdstar-20210221-00077390-a0123456789-43977755.log:ERROR: spectra 378:384 wavelengths 5992.6:6048.6 masking 200 flux=NaN pixels (12.1% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210412/prestdstar-20210412-00084535-a0123456789-43962320.log:ERROR: spectra 244:250 wavelengths 9472.0:9528.0 masking 250 flux=NaN pixels (79.7% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210412/prestdstar-20210412-00084517-a0123456789-43962192.log:ERROR: spectra 50:55 wavelengths 9390.2:9447.0 masking 200 flux=NaN pixels (68.3% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210412/prestdstar-20210412-00084517-a0123456789-43962192.log:ERROR: spectra 32:38 wavelengths 9389.4:9447.0 masking 200 flux=NaN pixels (18.8% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210412/prestdstar-20210412-00084517-a0123456789-43962192.log:ERROR: spectra 0:5 wavelengths 9309.4:9367.0 masking 200 flux=NaN pixels (94.0% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210412/prestdstar-20210412-00084517-a0123456789-43962192.log:ERROR: spectra 3:9 wavelengths 9308.6:9367.8 masking 200 flux=NaN pixels (93.6% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210412/prestdstar-20210412-00084517-a0123456789-43962192.log:ERROR: spectra 3:9 wavelengths 9348.6:9407.0 masking 200 flux=NaN pixels (39.5% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210412/prestdstar-20210412-00084517-a0123456789-43962192.log:ERROR: spectra 11:17 wavelengths 9269.4:9327.8 masking 200 flux=NaN pixels (99.9% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210412/prestdstar-20210412-00084517-a0123456789-43962192.log:ERROR: spectra 11:17 wavelengths 9348.6:9407.8 masking 200 flux=NaN pixels (50.9% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210514/prestdstar-20210514-00088390-a0123456789-43966017.log:ERROR: spectra 457:463 wavelengths 8710.9:8770.9 masking 200 flux=NaN pixels (5.3% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210417/prestdstar-20210417-00085186-a0123456789-43963732.log:ERROR: spectra 103:109 wavelengths 5830.9:5887.7 masking 200 flux=NaN pixels (0.0% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210107/prestdstar-20210107-00071273-a0123456789-43975287.log:ERROR: spectra 400:405 wavelengths 6592.7:6647.9 masking 200 flux=NaN pixels (12.0% input pixels masked)
/global/cfs/cdirs/desi/spectro/redux/everest/run/scripts/night/20210628/prestdstar-20210628-00096456-a0123456789-43969006.log:ERROR: spectra 319:325 wavelengths 7393.1:7447.5 masking 250 flux=NaN pixels (0.8% input pixels masked)

sbailey commented 3 years ago

For consideration, does it help to insert icov = 0.5*(icov + icov.T) to ensure exact symmetry before calling flux = solve(icov, y)? specter does that in specter.ex2d.resolution_from_icov line 533 to provide some robustness to rounding errors resulting in slightly non-symmetric matrices. It doesn't apply that earlier in the code at the point scipy.sparse.linalg.spsolve is called, so it isn't clear to me if that is really needed or if it was added while chasing a red herring.

I think it's worth trying in gpu_specter to see if it helps or significantly changes the situation. Otherwise at its core it looks like we need more regularization for ill-constrained patches to have better matrix conditioning.

dmargala commented 3 years ago

I have tried experimenting with symmetrization trick but I think there is a bigger issue at play.

Here is an interesting example from:

night = '20210629'
expid = 96676
camera = 'r3'
specrange = (365, 369)
waverange = (7160, 7199.2)

it looks like there is a spectra that is completely missing in that patch. Quite a challenge for extraction!

solve(icov, y) for that missing spectrum

The overlapping zeros in this plot show that the solution is going to be poorly constrained:

In specter, spsolve doesn't return nan's on that patch but the result is less than ideal:

desihub / gpu_specter

NaNs in extraction, icov not positive definite #67