Closed dmargala closed 1 year ago
@dmargala thanks. The changes look good (though I haven't tried actually running them yet). The test failures are due to latest numpy dropping "asscalar" which is still used by astropy 5.0. See https://github.com/desihub/desispec/commit/19630cbdc766a602d1fd957d44cc6dc9424d8123 for an example test config update to pin the numpy version.
The code test I would (will?) do is run a GPU extraction with main and this branch and verify that in the output file, and verify that the FLUX, IVAR, WAVELENGTH, RESOLUTION, FIBERMAP HDUs are bitwise identical, that CHI2PIX did change, and spot check the chi2pix-related MASK. I'm guessing that changed masking would also change SCORES but I'd have to check more closely to know exactly what pieces.
If you have already done that, please point us to before/after files at NERSC.
I generated the following two files with gpu_specter from main desi environment and this branch:
/global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-main.fits.gz
/global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-xflux-chi2pix.fits.gz
here is code test I ran (desi_extract_spectra command based on the one @marcelo-alvarez shared with me in slack):
# checkout this branch of gpu_specter
git clone git@github.com:desihub/gpu_specter.git
cd gpu_specter
git checkout dmargala/use-xflux-for-chi2pix
# start an interactive session on a perlmutter gpu node
salloc -A nstaff -C gpu -N 1 -q interactive -t 30
# extract frame using gpu_specter from desi environment
source /global/common/software/desi/desi_environment.sh main
srun -n 20 -c 2 --cpu-bind=cores -G 4 desi_mps_wrapper desi_extract_spectra -w 7520.0,9824.0,0.8 -i /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-gpu/preproc/20220404/00128680/preproc-z5-00128680.fits.gz -p /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-gpu/exposures/20220404/00128680/psf-z5-00128680.fits -o /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-main.fits.gz --psferr 0.1 --gpu-specter --nsubbundles 5 --mpi --use-gpu --barycentric-correction
# extract frame using this branch of gpu_specter
module unload gpu_specter
export PYTHONPATH=$(pwd)/py:$PYTHONPATH
export PATH=$(pwd)/bin:$PATH
srun -n 20 -c 2 --cpu-bind=cores -G 4 desi_mps_wrapper desi_extract_spectra -w 7520.0,9824.0,0.8 -i /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-gpu/preproc/20220404/00128680/preproc-z5-00128680.fits.gz -p /global/cfs/cdirs/desi/users/malvarez/spectro/redux/perlval-perlmutter-gpu/exposures/20220404/00128680/psf-z5-00128680.fits -o /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-xflux-chi2pix.fits.gz --psferr 0.1 --gpu-specter --nsubbundles 5 --mpi --use-gpu --barycentric-correction
Here is the fitsdiff output:
dmargala@nid001640:~/source/desihub/gpu_specter> fitsdiff /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-xflux-chi2pix.fits.gz /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-main.fits.gz
fitsdiff: 5.0
a: /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-xflux-chi2pix.fits.gz
b: /global/cfs/cdirs/desi/users/dmargala/frame-z5-00128680-main.fits.gz
Maximum number of different data values to be reported: 10
Relative tolerance: 0.0, Absolute tolerance: 0.0
Primary HDU:
Headers contain differences:
Keyword CHECKSUM has different values:
a> HRB3JQ92HQA2HQ72
b> IQD2KQ92IQA2IQ92
Keyword CHECKSUM has different comments:
a> HDU checksum updated 2022-09-19T16:47:55
? ^ ^^
b> HDU checksum updated 2022-09-19T16:45:36
? ^ ^^
Keyword DATASUM has different comments:
a> data unit checksum updated 2022-09-19T16:47:55
? ^ ^^
b> data unit checksum updated 2022-09-19T16:45:36
? ^ ^^
Extension HDU 1 (IVAR, 1):
Headers contain differences:
Keyword CHECKSUM has different values:
a> CAp6E4m6C9m6C9m6
b> DAq6F4n6D9n6D9n6
Keyword CHECKSUM has different comments:
a> HDU checksum updated 2022-09-19T16:47:56
? ^ ^
b> HDU checksum updated 2022-09-19T16:45:36
? ^ ^
Keyword DATASUM has different comments:
a> data unit checksum updated 2022-09-19T16:47:56
? ^ ^
b> data unit checksum updated 2022-09-19T16:45:36
? ^ ^
Extension HDU 2 (MASK, 1):
Headers contain differences:
Keyword CHECKSUM has different values:
a> 6EiH7BZF6BfF6BZF
b> 6Dhm8AZl6Afl6AZl
Keyword CHECKSUM has different comments:
a> HDU checksum updated 2022-09-19T16:47:56
? ^ ^
b> HDU checksum updated 2022-09-19T16:45:36
? ^ ^
Keyword DATASUM has different values:
a> 755007
b> 779839
Keyword DATASUM has different comments:
a> data unit checksum updated 2022-09-19T16:47:56
? ^ ^
b> data unit checksum updated 2022-09-19T16:45:36
? ^ ^
Data contains differences:
Data differs at [1429, 11]:
a> 0
b> 128
Data differs at [1838, 35]:
a> 0
b> 128
Data differs at [175, 91]:
a> 0
b> 128
Data differs at [176, 91]:
a> 1
b> 129
Data differs at [177, 91]:
a> 1
b> 129
Data differs at [178, 91]:
a> 1
b> 129
Data differs at [1145, 96]:
a> 129
b> 1
Data differs at [1146, 96]:
a> 129
b> 1
Data differs at [2167, 98]:
a> 1
b> 129
Data differs at [2171, 98]:
a> 0
b> 128
...
200 different pixels found (0.01% different).
Extension HDU 3 (WAVELENGTH, 1):
Headers contain differences:
Keyword CHECKSUM has different values:
a> hbaThaXRhaaRhaWR
b> ibaTiaZRiaaRiaYR
Keyword CHECKSUM has different comments:
a> HDU checksum updated 2022-09-19T16:47:56
? ^ ^
b> HDU checksum updated 2022-09-19T16:45:36
? ^ ^
Keyword DATASUM has different comments:
a> data unit checksum updated 2022-09-19T16:47:56
? ^ ^
b> data unit checksum updated 2022-09-19T16:45:36
? ^ ^
Extension HDU 4 (RESOLUTION, 1):
Headers contain differences:
Keyword CHECKSUM has different values:
a> QODJTMD9QMDGQMD9
b> RNEIUME9RMEGRME9
Keyword CHECKSUM has different comments:
a> HDU checksum updated 2022-09-19T16:47:56
? ---
b> HDU checksum updated 2022-09-19T16:45:37
? +++
Keyword DATASUM has different comments:
a> data unit checksum updated 2022-09-19T16:47:56
? ---
b> data unit checksum updated 2022-09-19T16:45:37
? +++
Extension HDU 5 (FIBERMAP, 1):
Headers contain differences:
Keyword CHECKSUM has different values:
a> 0a2a2a2W0a2a0a2W
b> 1a3a3a3W1a3a1a3W
Keyword CHECKSUM has different comments:
a> HDU checksum updated 2022-09-19T16:47:59
? ^ ^
b> HDU checksum updated 2022-09-19T16:45:39
? ^ ^
Keyword DATASUM has different comments:
a> data unit checksum updated 2022-09-19T16:47:59
? ^ ^
b> data unit checksum updated 2022-09-19T16:45:39
? ^ ^
Extension HDU 6 (CHI2PIX, 1):
Headers contain differences:
Keyword CHECKSUM has different values:
a> 3Aed61ec38ec38ec
b> BWkAEVh7BVhABVh7
Keyword CHECKSUM has different comments:
a> HDU checksum updated 2022-09-19T16:47:59
? ^ ^
b> HDU checksum updated 2022-09-19T16:45:39
? ^ ^
Keyword DATASUM has different values:
a> 4254562950
b> 2207126867
Keyword DATASUM has different comments:
a> data unit checksum updated 2022-09-19T16:47:59
? ^ ^
b> data unit checksum updated 2022-09-19T16:45:39
? ^ ^
Data contains differences:
Data differs at [1, 1]:
a> 1.5359675
b> 2.1983817
Data differs at [2, 1]:
a> 1.6271236
b> 1.8243557
Data differs at [3, 1]:
a> 1.4719399
b> 1.7377713
Data differs at [4, 1]:
a> 1.097299
b> 1.4232846
Data differs at [5, 1]:
a> 0.98081577
b> 1.3819599
Data differs at [6, 1]:
a> 0.7840894
b> 1.6679285
Data differs at [7, 1]:
a> 0.586903
b> 2.0428832
Data differs at [8, 1]:
a> 0.42576614
b> 3.7747188
Data differs at [9, 1]:
a> 0.28300866
b> 5.3522816
Data differs at [10, 1]:
a> 0.31326902
b> 4.3887677
...
1440500 different pixels found (100.00% different).
Extension HDU 7 (SCORES, 1):
Headers contain differences:
Keyword CHECKSUM has different values:
a> Eh8LFe5KEe5KEe5K
b> Fh9LGe6KFe6KFe6K
Keyword CHECKSUM has different comments:
a> HDU checksum updated 2022-09-19T16:47:59
? ^ ^
b> HDU checksum updated 2022-09-19T16:45:39
? ^ ^
Keyword DATASUM has different comments:
a> data unit checksum updated 2022-09-19T16:47:59
? ^ ^
b> data unit checksum updated 2022-09-19T16:45:39
? ^ ^
@dmargala thanks for this. I have compared both of the frame files from your test ("gp76-main" and "gp76-xflux-chi2pix"), as well as the original run I did on perlmutter GPUs with main for the same exposure ("pgpu"), to the same frame processed with CPUs on cori ("cori", which is nearly identical to that with CPUs on perlmutter).
As expected, "pgpu" and "gp76-main" agree, showing consistency between our use of gpu_specter with the main branch. Your new changes result in a chi2pix that is indeed different, but there is still at least one bin for which chi2pix>100 on cori, but not with gpu_specter.
Given this passed the tests @sbailey suggested, should we go ahead and merge?
Thanks, yes, let's merge this now as a step in the right direction and then continue debugging why gpu_specter and specter don't have closer agreement in chi2pix given that they do have close agreement in reconvolved flux and ivar.
Here is a PR to use the unconvolved extracted flux (
xflux
) instead of the reconvolved flux (flux = R * xflux
) in the calculation ofchi2pix
(see #75 for details).Summary of changes:
xflux
to the tuple returned byex2d_patch
xflux
in calculation ofchi2pix
instead offlux
ex2d_patch
throughout gpu_specter to handle change to return valuescc: @sbailey @marcelo-alvarez