desihub / gpu_specter

Scratch work for porting spectroperfectionism extractions to GPUs
BSD 3-Clause "New" or "Revised" License
2 stars 3 forks source link

TestCore and TestExtract failures on Perlmutter #77

Closed marcelo-alvarez closed 1 year ago

marcelo-alvarez commented 2 years ago

The current main branch fails unit tests on an interactive node of Perlmutter, in particular gpu_specter.test.test_core.TestCore and gpu_specter.test.test_extract.TestExtract.

@dmargala do you know whether this an actual failure with gpu_specter or rather in the way the unit test is being set up?

Commands to reproduce failures on Perlmutter:

% salloc --qos=interactive -N 1 --time=60 -C gpu -A desi --gpus-per-node=4
% source /global/common/software/desi/desi_environment.sh main
% cd $PSCRATCH; mkdir -p tmp; cd tmp
% git clone https://github.com/desihub/gpu_specter
% cd gpu_specter
% python setup.py test
.
.
.
test_compare_specter (gpu_specter.test.test_spots.TestPSFSpots) ... ok

======================================================================
FAIL: test_compare_gpu (gpu_specter.test.test_core.TestCore)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/cfs/cdirs/desi/users/malvarez/unit_tests/gpu_specter/py/gpu_specter/test/test_core.py", line 233, in test_compare_gpu
    self.assertTrue(np.alltrue(np.abs(pull) < pull_threshold))
AssertionError: False is not true

======================================================================
FAIL: test_compare_batch_extraction (gpu_specter.test.test_extract.TestExtract)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/cfs/cdirs/desi/users/malvarez/unit_tests/gpu_specter/py/gpu_specter/test/test_extract.py", line 340, in test_compare_batch_extraction
    np.testing.assert_allclose(flux0, flux1, rtol=1e5*eps_double, atol=0, err_msg=f"where: {where}")
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=2.22045e-11, atol=0
where: (array([1]), array([2]))
Mismatched elements: 1 / 250 (0.4%)
Max absolute difference: 7.03266778e-10
Max relative difference: 2.40644863e-11
 x: array([[ 1.087746e+02,  9.138172e+01,  1.039367e+02,  1.117145e+02,
         1.130559e+02,  1.025442e+02,  9.267082e+01,  1.078135e+02,
         9.409715e+01,  2.029707e+02,  3.838210e+02,  1.821220e+02,...
 y: array([[ 1.087746e+02,  9.138172e+01,  1.039367e+02,  1.117145e+02,
         1.130559e+02,  1.025442e+02,  9.267082e+01,  1.078135e+02,
         9.409715e+01,  2.029707e+02,  3.838210e+02,  1.821220e+02,...

----------------------------------------------------------------------
Ran 32 tests in 57.178s

FAILED (failures=2)
Test failed: <unittest.runner.TextTestResult run=32 errors=0 failures=2>
error: Test failed: <unittest.runner.TextTestResult run=32 errors=0 failures=2>
sbailey commented 1 year ago

@dmargala do you have cycles to look at this? We're perhaps being too picky in how close the arrays need to match to pass the test, but presumably they passed when the tests were originally written, so it is worth a little thought and comparison to other studies where we signed off on gpu_specter as being close enough to use in production.

dmargala commented 1 year ago

@sbailey I'll take a look today. There's not too much info in the output of the first error to go on. The second failure definitely looks like it might be too picky:

np.testing.assert_allclose(flux0, flux1, rtol=1e5*eps_double, atol=0, err_msg=f"where: {where}")
...
Not equal to tolerance rtol=2.22045e-11, atol=0
where: (array([1]), array([2]))
Mismatched elements: 1 / 250 (0.4%)
Max absolute difference: 7.03266778e-10
Max relative difference: 2.40644863e-11
sbailey commented 1 year ago

Fixed by PR #79. Closing.