desihub / redrock

Redshift fitting for spectroperfectionism
BSD 3-Clause "New" or "Revised" License
21 stars 13 forks source link

Archetype failure due to key error in trans dict and failure in deg-leg 0 case #278

Closed abhi0395 closed 2 months ago

abhi0395 commented 5 months ago

This pull request solves two separate issues on the main branch.

1) For some targetids the archetype mode fails citing keyerror in the trans dictionary. 2) The main branch fails if we want to run the archetype method without Legendre polynomials.

Example run for case (1) on `main' branch:

rrdesi -i /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/6786/20240117/coadd-4-6786-thru20240117.fits -o arch_test-39628278872412904.fits -d arch_39628278872412904.h5 --archetypes /global/homes/a/abhijeet/software/desisoft/new-archetypes/rrarchetype-galaxy.fits --targetids 39628278872412904

fails citing following error:

MP calc_zchi2: Traceback (most recent call last):
MP calc_zchi2:   File "/global/homes/a/abhijeet/software/desisoft/redrock/py/redrock/zfind.py", line 90, in _mp_fitz
    zfit = fitz(chi2[i], t.template.redshifts, tg,
MP calc_zchi2:   File "/global/homes/a/abhijeet/software/desisoft/redrock/py/redrock/fitz.py", line 338, in fitz
    chi2min, coeff, fulltype = archetype.get_best_archetype(target,weights,flux,wflux,dwave,zbest, per_camera, n_nearest, trans=trans, use_gpu=use_gpu, prior=prior)
MP calc_zchi2:   File "/global/homes/a/abhijeet/software/desisoft/redrock/py/redrock/archetypes.py", line 321, in get_best_archetype
    if (trans[hs] is not None):
MP calc_zchi2: KeyError: -5909694164499214244

But when run on archetype_leg0_and_trans_fix, it succeeds.

Example run for case (2):

rrdesi -i /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11472/20240116/coadd-0-11472-thru20240116.fits -o /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/archetype_leg0_and_trans_fix/redrock_test-39628128854739056.fits -d /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/archetype_leg0_and_trans_fix/redrock_39628128854739056.h5 --targetids 39628128854739056 --archetypes /global/homes/a/abhijeet/software/desisoft/new-archetypes/rrarchetype-galaxy.fits

fails on `main' branch:

MP calc_zchi2:   File "/global/homes/a/abhijeet/software/desisoft/redrock/py/redrock/archetypes.py", line 334, in get_best_archetype
    (zzchi2, zzcoeff) = per_camera_coeff_with_least_square_batch(target, tdata, weights, flux, wflux, nleg, self._narch, method=solve_method, n_nbh=1, prior=prior, use_gpu=use_gpu, ncam=ncam)
MP calc_zchi2:   File "/global/homes/a/abhijeet/software/desisoft/redrock/py/redrock/zscan.py", line 329, in per_camera_coeff_with_least_square_batch
    coeff = np.concatenate(list(ret_zcoeff.values()), axis=1)
MP calc_zchi2:   File "<__array_function__ internals>", line 180, in concatenate
MP calc_zchi2: ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

But succeeds on archetype_leg0_and_trans_fix branch.

Sanity checks

Comparing results from main and this branch:

- Archetype mode (main):

srun -n 4 -c 4 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 -i rrdesi_mpi -i /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11472/20240116/coadd-0-11472-thru20240116.fits -o /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/main/arch_redrock-0-11472-thru20240116.fits -d /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/main/arch_redrock-0-11472-thru20240116.fits --archetypes /global/homes/a/abhijeet/software/desisoft/new-archetypes/rrarchetype-galaxy.fits

- Archetype mode (this branch):

srun -n 4 -c 4 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 -i rrdesi_mpi -i /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11472/20240116/coadd-0-11472-thru20240116.fits -o /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/archetype_leg0_and_trans_fix/arch_redrock-0-11472-thru20240116.fits -d /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/archetype_leg0_and_trans_fix/arch_redrock-0-11472-thru20240116.fits --archetypes /global/homes/a/abhijeet/software/desisoft/new-archetypes/rrarchetype-galaxy.fits

- Comparison:

file1 = f'/global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/main/arch_redrock-0-11472-thru20240116.fits'
file2 = f'/global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/archetype_leg0_and_trans_fix/arch_redrock-0-11472-thru20240116.fits'
tt1 = Table.read(file1, hdu=1)
tt2 = Table.read(file2, hdu=1)
np.count_nonzero(tt1['Z']-tt2['Z']

- Non archteype mode (main):

srun -n 4 -c 4 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 -i rrdesi_mpi -i /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11472/20240116/coadd-0-11472-thru20240116.fits -o /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/main/redrock-0-11472-thru20240116.fits -d /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/main/redrock-0-11472-thru20240116.fits

- Non archetype mode (this branch):

srun -n 4 -c 4 --gpu-bind=map_gpu:3,2,1,0 rrdesi_mpi --gpu --max-gpuprocs 4 -i rrdesi_mpi -i /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11472/20240116/coadd-0-11472-thru20240116.fits -o /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/archetype_leg0_and_trans_fix/redrock-0-11472-thru20240116.fits -d /global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/archetype_leg0_and_trans_fix/redrock-0-11472-thru20240116.h5

file1 = f'/global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/main/redrock-0-11472-thru20240116.fits'
file2 = f'/global/cfs/cdirs/desi/users/abhijeet/test_archetype_runs/archetype_leg0_and_trans_fix/redrock-0-11472-thru20240116.fits'
tt1 = Table.read(file1, hdu=1)
tt2 = Table.read(file2, hdu=1)
np.count_nonzero(tt1['Z']-tt2['Z'])
coveralls commented 5 months ago

Coverage Status

coverage: 38.331% (-0.2%) from 38.566% when pulling 410de5fdcb09dc6f3ecec66a90144c2bbbad5979 on archetype_leg0_and_trans_fix into 100d5b3f043c54627c6998a6dfdf549380e964f2 on main.

sbailey commented 5 months ago

This looks good, though the fix to trans is so basic that I don't understand how archetypes ever worked without it. You mention "For some targetids the archetype mode fails..." which implies that for some it succeeds? Do you understand what is different about those other cases that succeeded? e.g. perhaps the successful cases had all of the minima z<2 so it never tried to access the IGM transmission?

I'm concerned that there is some other bookkeeping bug in addition to the one you fixed here, so I'd like to understand why the fix was so simple and what allowed the previously succeeding cases to work.

abhi0395 commented 5 months ago

I agree with you that the problem is not very well understood. I tried to dig into this more and here is what I found. The issue appears in the following part of the function fitz()

https://github.com/desihub/redrock/blob/100d5b3f043c54627c6998a6dfdf549380e964f2/py/redrock/fitz.py#L277C17-L297C26

                trans[k] = T
                if (T is None):
                    #Return value of None means that wavelenght regime
                    #does not overlap Lyman transmission - continue here
                    continue
                #Vectorize multiplication
                binned[k] *= T[:,:,None]
            #Use CPU always with one redshift
            (chi2, coeff) = calc_zchi2_batch(spectra, binned, weights, flux, wflux, 1, nbasis,
                                             solve_matrices_algorithm=template.solve_matrices_algorithm,
                                             use_gpu=False)
            coeff = coeff[0,:]
        except ValueError as err:
            if zmin<redshifts[0] or redshifts[-1]<zmin:
                #- beyond redshift range can be invalid for template
                coeff = np.zeros(template.nbasis)
                zwarn |= ZW.Z_FITLIMIT
                zwarn |= ZW.BAD_MINFIT
            else:
                #- Unknown problem; re-raise error
                raise err

The issue is that in case of ValueError, we do not deal with trans dictionary at all, which is passed in archetypes.get_best_archetype() in archetype mode. In case of PCA, it never fails because

(chi2, coeff) = calc_zchi2_batch(spectra, binned, weights, flux, wflux, 1, nbasis,
                                             solve_matrices_algorithm=template.solve_matrices_algorithm,
                                             use_gpu=False)

is never run in the case of ValueError.

for the following failure case:

rrdesi -i /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11472/20240116/coadd-0-11472-thru20240116.fits -o redrock_test-39628128854739056.fits -d redrock_39628128854739056.h5 --targetids 39628128854739056 --archetypes /global/homes/a/abhijeet/software/desisoft/new-archetypes/rrarchetype-galaxy.fits

I just printed

print(zz[i-1:i+2], zzchi2[i-1:i+2], zmin, redshifts[0], redshifts[-1])
[0.5109059  0.51105493 0.51120395] [7877.36377944 7870.93398968 7864.50311703] -0.3738767086571399 -0.0050000000000000044 1.6997470838899442

It appears that parabolic fit fails for this chi2 vs redshift and zmin is negative, therefore a ZWARN is raised and stored in PCA mode without modifying the trans dictionary. But in archetype, we pass all these redshifts and corresponding trans dictionary for the final fit and therefore it fails.