Closed dmargala closed 3 months ago
Thanks for the cleanup. I hadn't realized that tests were being skipped due to cleaning up disk space by removing old cascades production files. Rather than switching to daily (which might also change someday), let's switch the tests to use an exposure in public Early Data Release Fuji prod. I tried switching to
nerscdir = '/global/cfs/cdirs/desi/public/edr/spectro/redux/fuji'
nerscurl = 'https://data.desi.lbl.gov/public/edr/spectro/redux/fuji'
night = 20201214
expid = 67784
and updating the paths to use e.g. {night}/{expid:08d}/...
, but that particular exposure failed TestCore.test_compare_gpu with
...
> np.testing.assert_array_less(np.abs(pull), pull_threshold)
E AssertionError:
E Arrays are not less-ordered
E
E Mismatched elements: 21 / 46520 (0.0451%)
E Max absolute difference: 0.03594679
E Max relative difference: 71.89358866
E x: array([[1.031810e-12, 5.475983e-13, 1.303668e-13, ..., 2.794550e-12,
E 2.557298e-12, 1.272082e-12],
E [0.000000e+00, 0.000000e+00, 0.000000e+00, ..., 4.972220e-12,...
E y: array(0.0005)
py/gpu_specter/test/test_core.py:247: AssertionError
My uncommited changes are in /global/common/software/desi/users/sjbailey/gpu_specter .
Please take a look at that test and see if this is a fundamental problem, or a bad luck choice of exposure with harmless differences, and update the tests to use something in Fuji with fully public paths and URLs.
Other changes in this PR look good.
Feature creep: could you also try to silence warnings like
py/gpu_specter/test/test_spots.py::TestPSFSpots::test_compare_gpu
/global/common/software/desi/perlmutter/desiconda/20230111-2.1.0/conda/lib/python3.10/site-packages/numba/cuda/dispatcher.py:488: NumbaPerformanceWarning: Grid size 18 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
I'll take a look at that test failure with the new exposure. We have a pretty strict pull threshold in that test since it is comparing cpu / gpu code paths within gpu_specter (some tests that compare specter / gpu_specter have looser thresholds).
Re noisy NumbaPerformanceWarning, in #71 I tried to disable those warnings for typical usage but it seems pytest ignores that?
I'm not sure of the best way to silence those for pytests. I imagine we don't want to silence warnings too aggressively.
It seems like pytest can be configured to ignore specifc warnings by adding something to a pytest.ini
file (or a pyproject.toml
with slightly different syntax):
[pytest]
filterwarnings =
ignore::numba.core.errors.NumbaPerformanceWarning
Do you know if there is desihub precendent for this?
@dmargala I just re-discovered this old PR.
I added the pytest.ini to silence numba.core.errors.NumbaPerformanceWarning. I see there's one NumbaDeprecationWarning still but it's coming from specter so I don't think we should do anything about it here.
...lib/python3.11/site-packages/specter/util/util.py:219: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit
regarding the test failure with fuji night=20201214 expid=67784. it looks like there are some extreme flux values are tripping things up but I'm not sure where the discrepancy is coming from.
import os
import numpy as np
import matplotlib.pyplot as plt
from gpu_specter.io import read_img, read_psf, read_frame
from gpu_specter.core import extract_frame
from gpu_specter.test.util import find_test_file
desi_spectro_redux = '/global/cfs/cdirs/desi/public/edr/spectro/redux'
frame_path = 'fuji/exposures/20201214/00067784/frame-r0-00067784.fits'
psf_path = 'fuji/exposures/20201214/00067784/psf-r0-00067784.fits'
image_path = 'fuji/preproc/20201214/00067784/preproc-r0-00067784.fits'
framedata = read_frame(os.path.join(desi_spectro_redux, frame_path))
psfdata = read_psf(os.path.join(desi_spectro_redux, psf_path))
imgdata = read_img(os.path.join(desi_spectro_redux, image_path))
rank = 0
size = 1
comm = None
bundlesize = 10
wavelength = '5760.0,7620.0,0.8'
specmin = 0
nspec = 20
nwavestep = 50
nsubbundles = 2
frame_cpu = extract_frame(
imgdata, psfdata, bundlesize,
specmin, nspec,
wavelength=wavelength,
nwavestep=nwavestep, nsubbundles=nsubbundles,
comm=comm,
gpu=None,
loglevel='WARN',
)
frame_gpu = extract_frame(
imgdata, psfdata, bundlesize,
specmin, nspec,
wavelength=wavelength,
nwavestep=nwavestep, nsubbundles=nsubbundles,
comm=comm,
gpu=True,
loglevel='WARN',
batch_subbundle=False,
)
np.testing.assert_equal(frame_cpu['specflux'].shape, frame_gpu['specflux'].shape)
#- Compute pull (dflux*sigma) ignoring masked pixels
mask_cpu = (
(frame_cpu['specivar'] == 0) |
(frame_cpu['chi2pix'] > 100) |
(frame_cpu['pixmask_fraction'] > 0.5)
)
mask_gpu = (
(frame_gpu['specivar'] == 0) |
(frame_gpu['chi2pix'] > 100) |
(frame_gpu['pixmask_fraction'] > 0.5)
)
mask = mask_cpu | mask_gpu
var_cpu = np.reciprocal(~mask*frame_cpu['specivar'] + mask)
var_gpu = np.reciprocal(~mask*frame_gpu['specivar'] + mask)
ivar = np.reciprocal(~mask*(var_cpu + var_gpu) + mask)
dflux = frame_cpu['specflux'] - frame_gpu['specflux']
pull = ~mask*dflux*np.sqrt(ivar)
pull_threshold = 5e-4
#- these are the tests that fail
# np.testing.assert_array_less(np.abs(pull), pull_threshold)
# eps_double = np.finfo(np.float64).eps
# np.testing.assert_allclose(~mask*frame_cpu['specflux'], ~mask*frame_gpu['specflux'], rtol=1e-3, atol=0)
# np.testing.assert_allclose(~mask*frame_cpu['specivar'], ~mask*frame_gpu['specivar'], rtol=1e-3, atol=0)
# idiag = frame_cpu['Rdiags'].shape[1] // 2
# np.testing.assert_allclose(~mask*frame_cpu['Rdiags'][:, idiag, :], ~mask*frame_gpu['Rdiags'][:, idiag, :], rtol=1e-4)
fig, axes = plt.subplots(nrows=3, figsize=(8, 6), sharex=True)
ax = axes[0]
ax.plot(frame_cpu['wave'], np.abs(pull.T))
ax.axhline(pull_threshold, color='red', label='pull threshold')
ax.set_ylabel('pull')
ax.set_yscale('log')
ax.legend()
ax.set_title('pull cpu vs gpu extraction test')
ax.grid()
ax = axes[1]
ax.plot(framedata['wave'], np.abs(framedata['flux'])[specmin:nspec].T)
ax.set_ylabel('log abs flux')
ax.set_yscale('log')
ax.set_title('\n'.join([f'First {nspec} spectra', frame_path]))
ax.grid()
ax = axes[2]
ax.plot(framedata['wave'], np.abs(framedata['ivar']*framedata['flux'])[specmin:nspec].T)
ax.set_ylabel('ivar * flux')
ax.set_xlabel('wave')
ax.set_title('\n'.join([f'First {nspec} spectra', frame_path]))
ax.grid()
plt.tight_layout()
plt.show()
Thanks for the test plots and code. Looking into further details, the pull values out of threshold aren't directly on the spectra with large flux values, but rather are on the neighboring fibers which get impacted by the high flux in the wings of their PSF. I tried a few other cameras and encountered similar problems on r1, r2, and r4, though camera r3 is ok.
At this point we are effectively committed to the GPU code as being good/consistent enough. We could alter the test rules to be something like the 99.9th percentile of the pull has to be within threshold instead of requiring all values to pass.
Okay, I think I've traced the discrepancy between gpu and cpu down to the regularization term in the extraction which was changed in the gpu version only here: https://github.com/desihub/gpu_specter/commit/ca1612147f5adcbbe2d7413a601e0514d75f5376 . In the gpu version, regularization term being squared (I don't think extra 1e-15 term matters much here).
specter reference: https://github.com/desihub/specter/blob/main/py/specter/extract/ex2d.py#L363-L382
@dmargala good debugging. I updated both extract.cpu.deconvolve
and extract.both.xp_deconvolve
to match extract.gpu.deconvolve
and now all tests pass again (at NERSC) with the updated fuji-based PSF.
This PR fixes a number of issues related to changes and deprecations in dependent packages.
851b4b4 affects attempts to run unit tests at NERSC since those will try to use a preproc input file from the "cascades" reduction. It seems that reduction is no longer available so I changed the path to the only other location I could find that exposure which is in the "daily" reductions.
Each commit fixes a distinct issue: