Legendre nnls - Githubissues

craigwarner-ufastro commented 3 months ago

Non-negative least squares (NNLS) is significantly faster than the general case bounded value least squares (BVLS). For archetypes, BVLS had been used where the archetype term had a bounded solution from 0 to inf and all of the legendre term solutions were allowed to float. By adding additional terms to our basis vector that are the inverse of each legendre term, we are able to use NNLS instead, e.g., we have negative copies, v3 = -v1 and v4 = -v2 so that in our solution we could have c1 > 0, c3 = 0 for a positive result and the opposite for a negative.

This also required a modification to the prior array which was added to the M vector. The results using NNLS now agree to within np.allclose to the original BVLS solution, with a significant speed gain. The Finding best redshift section takes 45s on 4 GPU + 4 CPU versus 99s using BVLS. However this only applies to when using GPUs.

When using 64 CPUs, additional negative legendre terms + NNLS are 113s for this section versus 74s for BVLS. This is because the computation time for various array operations such as dot products on the larger arrays with the added terms overwhelms the speed gains from NNLS vs BVLS and thus when using CPU mode, we do NOT use the NNLS method and maintain the original BVLS.

coveralls commented 3 months ago

coverage: 37.979% (-0.6%) from 38.533% when pulling fd7661725bab695cf836851360f3614c298694dd on legendre_nnls into 028848f75d1f981f4ec1f74b112147b495c20d82 on main.

craigwarner-ufastro commented 3 months ago

Note: This also includes a bug fix related to #278

sbailey commented 2 months ago

@craigwarner-ufastro thanks for this update and apologies for the late review.

First of all, a bug:

srun -n 4 --gpu-bind=map_gpu:3,2,1,0 --cpu-bind=cores rrdesi_mpi --gpu \
  -i $CFS/desi/spectro/redux/iron/tiles/cumulative/80605/20210205/coadd-6-80605-thru20210205.fits \
  --archetypes /global/cfs/cdirs/desi/users/abhijeet/new-archetypes/rrarchetype-galaxy.fits \
  -o $SCRATCH/redrock/redrock-legnnls.fits
...
--- Process 0 raised an exception ---
Proc 0: Traceback (most recent call last):
Proc 0:   File "/global/common/software/desi/users/sjbailey/redrock/py/redrock/external/desi.py", line 916, in rrdesi
    scandata, zfit = zfind(targets, dtemplates, mpprocs,
Proc 0:   File "/global/common/software/desi/users/sjbailey/redrock/py/redrock/zfind.py", line 602, in zfind
    if archetype._solver_method == 'bvls':
Proc 0: AttributeError: 'NoneType' object has no attribute '_solver_method'

MPICH Notice [Rank 0] [job id 24313752.5] [Fri Apr 12 09:37:16 2024] [nid002260] - Abort(0) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0

Second, this update required a lot of delicate bookkeeping about the number of legendre coefficients, setting up the prior, etc. impacting multiple functions in multiple files. I'm concerned about that from a maintenance perspective. It feels like this could/should be an internal detail of the solver, not something that requires the "user" to setup that negative/positive legendre trick themselves and then unpack the results. i.e. could you move all of this logic into per_camera_coeff_with_least_square_batch itself? If it receives method='bvls' and use_gpu=True, it would automatically switch to the +/- legendre nnls method, including padding the arrays and expanding the prior as needed? Can that be implemented with similar speed gains, while isolating all of the messy bookkeeping to a single location?

sbailey commented 2 months ago

Thanks for the updates, @craigwarner-ufastro . Please resolve the merge conflicts that have arisen here due to other merge-conflicts. If they aren't trivial to resolve in-place, please don't merge main into this branch, which would make the changes in this PR harder to evaluate mixed in with all the other changes. If needed, ok to make a new branch from this branch and then rebase that off of main and start a new PR. Ping me on Slack if that operation description is confusing.

Back to the updates: I'd like to take this one step further to avoid legendre-specific manipulations to tdata prior to calling the solver, e.g. in archetypes.nearest_neighbor_model:

tdata[hs] = np.append(tdata[hs], legendre[hs].transpose()[None,:,:], axis=2)
                if per_camera and self._solver_method == 'bvls' and use_gpu:
                    #Using BVLS and GPU - append negative copies of legendre terms as well to use NNLS with additional bases to approximate BVLS
                    tdata[hs] = np.append(tdata[hs], -1*legendre[hs].transpose()[None,:,:], axis=2)

and in zfind.zfind:

        tot_deg_legendre = deg_legendre
        for key in archetypes:
            if archetypes[key]._solver_method == 'bvls':
                tot_deg_legendre = deg_legendre*2
...
        maxcoeff = max(np.max([t.template.nbasis for t in templates]), ncamera*(tot_deg_legendre)+n_nearest)

in the second case, I'm not even sure that is correct because we are reserving deg_legendre*2 coefficients but those extra 2x coefficients are for the BVLS -> NNLS trick and shouldn't be needed for the output arrays.

Suggestion: rather than manipulating tdata (template data) to add legendre terms and then calling per_camera_coeff_with_least_square_batch(... tdata, nleg=...), could you pass tdata and the normal legendre array, and let per_camera_coeff_with_least_square_batch derive nleg from the shape of legendre and whether it needs to do the BLVS -> NNLS trick and tdata manipulation internally based upon method and use_gpu?

i.e. from a user-perspective, turn this into "if you want to include legendre terms, pass in the legendre template arrays to use" and all other details of how those are combined with tdata etc. become internal to per_camera_coeff_with_least_square_batch. Thoughts?

desihub / redrock

Legendre nnls #287