desihub / gpu_specter

Scratch work for porting spectroperfectionism extractions to GPUs
BSD 3-Clause "New" or "Revised" License
2 stars 3 forks source link

fixes for misc changes and deprecations #80

Closed dmargala closed 3 months ago

dmargala commented 6 months ago

This PR fixes a number of issues related to changes and deprecations in dependent packages.

851b4b4 affects attempts to run unit tests at NERSC since those will try to use a preproc input file from the "cascades" reduction. It seems that reduction is no longer available so I changed the path to the only other location I could find that exposure which is in the "daily" reductions.

Each commit fixes a distinct issue:

sbailey commented 6 months ago

Thanks for the cleanup. I hadn't realized that tests were being skipped due to cleaning up disk space by removing old cascades production files. Rather than switching to daily (which might also change someday), let's switch the tests to use an exposure in public Early Data Release Fuji prod. I tried switching to

    nerscdir = '/global/cfs/cdirs/desi/public/edr/spectro/redux/fuji'
    nerscurl = 'https://data.desi.lbl.gov/public/edr/spectro/redux/fuji'
    night = 20201214
    expid = 67784

and updating the paths to use e.g. {night}/{expid:08d}/..., but that particular exposure failed TestCore.test_compare_gpu with

...
>       np.testing.assert_array_less(np.abs(pull), pull_threshold)
E       AssertionError: 
E       Arrays are not less-ordered
E       
E       Mismatched elements: 21 / 46520 (0.0451%)
E       Max absolute difference: 0.03594679
E       Max relative difference: 71.89358866
E        x: array([[1.031810e-12, 5.475983e-13, 1.303668e-13, ..., 2.794550e-12,
E               2.557298e-12, 1.272082e-12],
E              [0.000000e+00, 0.000000e+00, 0.000000e+00, ..., 4.972220e-12,...
E        y: array(0.0005)

py/gpu_specter/test/test_core.py:247: AssertionError

My uncommited changes are in /global/common/software/desi/users/sjbailey/gpu_specter .

Please take a look at that test and see if this is a fundamental problem, or a bad luck choice of exposure with harmless differences, and update the tests to use something in Fuji with fully public paths and URLs.

Other changes in this PR look good.

Feature creep: could you also try to silence warnings like

py/gpu_specter/test/test_spots.py::TestPSFSpots::test_compare_gpu
  /global/common/software/desi/perlmutter/desiconda/20230111-2.1.0/conda/lib/python3.10/site-packages/numba/cuda/dispatcher.py:488: NumbaPerformanceWarning: Grid size 18 will likely result in GPU under-utilization due to low occupancy.
    warn(NumbaPerformanceWarning(msg))
dmargala commented 6 months ago

I'll take a look at that test failure with the new exposure. We have a pretty strict pull threshold in that test since it is comparing cpu / gpu code paths within gpu_specter (some tests that compare specter / gpu_specter have looser thresholds).

Re noisy NumbaPerformanceWarning, in #71 I tried to disable those warnings for typical usage but it seems pytest ignores that?

I'm not sure of the best way to silence those for pytests. I imagine we don't want to silence warnings too aggressively.

It seems like pytest can be configured to ignore specifc warnings by adding something to a pytest.ini file (or a pyproject.toml with slightly different syntax):

[pytest]
filterwarnings =
    ignore::numba.core.errors.NumbaPerformanceWarning

Do you know if there is desihub precendent for this?

sbailey commented 4 months ago

@dmargala I just re-discovered this old PR.

dmargala commented 4 months ago

I added the pytest.ini to silence numba.core.errors.NumbaPerformanceWarning. I see there's one NumbaDeprecationWarning still but it's coming from specter so I don't think we should do anything about it here.

...lib/python3.11/site-packages/specter/util/util.py:219: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
    @numba.jit

regarding the test failure with fuji night=20201214 expid=67784. it looks like there are some extreme flux values are tripping things up but I'm not sure where the discrepancy is coming from.

image


import os

import numpy as np
import matplotlib.pyplot as plt

from gpu_specter.io import read_img, read_psf, read_frame
from gpu_specter.core import extract_frame
from gpu_specter.test.util import find_test_file

desi_spectro_redux = '/global/cfs/cdirs/desi/public/edr/spectro/redux'
frame_path = 'fuji/exposures/20201214/00067784/frame-r0-00067784.fits'
psf_path   = 'fuji/exposures/20201214/00067784/psf-r0-00067784.fits'
image_path = 'fuji/preproc/20201214/00067784/preproc-r0-00067784.fits'

framedata = read_frame(os.path.join(desi_spectro_redux, frame_path))

psfdata = read_psf(os.path.join(desi_spectro_redux, psf_path))
imgdata = read_img(os.path.join(desi_spectro_redux, image_path))

rank = 0
size = 1
comm = None

bundlesize = 10
wavelength = '5760.0,7620.0,0.8'

specmin = 0
nspec = 20
nwavestep = 50
nsubbundles = 2

frame_cpu = extract_frame(
    imgdata, psfdata, bundlesize,
    specmin, nspec,
    wavelength=wavelength,
    nwavestep=nwavestep, nsubbundles=nsubbundles,
    comm=comm,
    gpu=None,
    loglevel='WARN',
)

frame_gpu = extract_frame(
    imgdata, psfdata, bundlesize,
    specmin, nspec,
    wavelength=wavelength,
    nwavestep=nwavestep, nsubbundles=nsubbundles,
    comm=comm,
    gpu=True,
    loglevel='WARN',
    batch_subbundle=False,
)

np.testing.assert_equal(frame_cpu['specflux'].shape, frame_gpu['specflux'].shape)

#- Compute pull (dflux*sigma) ignoring masked pixels
mask_cpu = (
    (frame_cpu['specivar'] == 0) |
    (frame_cpu['chi2pix'] > 100) |
    (frame_cpu['pixmask_fraction'] > 0.5)
)
mask_gpu = (
    (frame_gpu['specivar'] == 0) |
    (frame_gpu['chi2pix'] > 100) |
    (frame_gpu['pixmask_fraction'] > 0.5)
)
mask = mask_cpu | mask_gpu
var_cpu = np.reciprocal(~mask*frame_cpu['specivar'] + mask)
var_gpu = np.reciprocal(~mask*frame_gpu['specivar'] + mask)
ivar = np.reciprocal(~mask*(var_cpu + var_gpu) + mask)
dflux = frame_cpu['specflux'] - frame_gpu['specflux']
pull = ~mask*dflux*np.sqrt(ivar)
pull_threshold = 5e-4

#- these are the tests that fail
# np.testing.assert_array_less(np.abs(pull), pull_threshold)
# eps_double = np.finfo(np.float64).eps
# np.testing.assert_allclose(~mask*frame_cpu['specflux'], ~mask*frame_gpu['specflux'], rtol=1e-3, atol=0)
# np.testing.assert_allclose(~mask*frame_cpu['specivar'], ~mask*frame_gpu['specivar'], rtol=1e-3, atol=0)
# idiag = frame_cpu['Rdiags'].shape[1] // 2
# np.testing.assert_allclose(~mask*frame_cpu['Rdiags'][:, idiag, :], ~mask*frame_gpu['Rdiags'][:, idiag, :], rtol=1e-4)

fig, axes = plt.subplots(nrows=3, figsize=(8, 6), sharex=True)
ax = axes[0]

ax.plot(frame_cpu['wave'], np.abs(pull.T))
ax.axhline(pull_threshold, color='red', label='pull threshold')
ax.set_ylabel('pull')
ax.set_yscale('log')
ax.legend()
ax.set_title('pull cpu vs gpu extraction test')
ax.grid()
ax = axes[1]

ax.plot(framedata['wave'], np.abs(framedata['flux'])[specmin:nspec].T)
ax.set_ylabel('log abs flux')
ax.set_yscale('log')
ax.set_title('\n'.join([f'First {nspec} spectra', frame_path]))
ax.grid()

ax = axes[2]
ax.plot(framedata['wave'], np.abs(framedata['ivar']*framedata['flux'])[specmin:nspec].T)
ax.set_ylabel('ivar * flux')
ax.set_xlabel('wave')
ax.set_title('\n'.join([f'First {nspec} spectra', frame_path]))
ax.grid()

plt.tight_layout()
plt.show()
sbailey commented 4 months ago

Thanks for the test plots and code. Looking into further details, the pull values out of threshold aren't directly on the spectra with large flux values, but rather are on the neighboring fibers which get impacted by the high flux in the wings of their PSF. I tried a few other cameras and encountered similar problems on r1, r2, and r4, though camera r3 is ok.

At this point we are effectively committed to the GPU code as being good/consistent enough. We could alter the test rules to be something like the 99.9th percentile of the pull has to be within threshold instead of requiring all values to pass.

dmargala commented 4 months ago

Okay, I think I've traced the discrepancy between gpu and cpu down to the regularization term in the extraction which was changed in the gpu version only here: https://github.com/desihub/gpu_specter/commit/ca1612147f5adcbbe2d7413a601e0514d75f5376 . In the gpu version, regularization term being squared (I don't think extra 1e-15 term matters much here).

specter reference: https://github.com/desihub/specter/blob/main/py/specter/extract/ex2d.py#L363-L382

sbailey commented 3 months ago

@dmargala good debugging. I updated both extract.cpu.deconvolve and extract.both.xp_deconvolve to match extract.gpu.deconvolve and now all tests pass again (at NERSC) with the updated fuji-based PSF.