Batch subbundle extraction on GPU

This PR attempts to leverage more batch operations on the GPU by refactoring the extraction of individual patches into a stack of patches. It's a little tricky because the patches on the ccd do not have the same shape. The patches do stack cleanly once the projection and weight matrices are applied, so the heavy linear algebra operations (solve and eigh) can be performed in batch.

The tables below shows before and after results using the 30 frame exposure extract script using a single node with 4 GPUs and 2 MPI ranks per GPU on corigpu (5 MPI ranks per GPU on dgx).

Before:

system	elapsed time (sec)	FPNH	FPGH
corigpu	876.9	123.16	30.79
dgx	618.6	174.60	43.65

This PR:

system	elapsed time (sec)	FPNH	FPGH
corigpu	399.3	270.45	67.61
dgx	246.7	437.85	109.46

desihub / gpu_specter

Batch subbundle extraction on GPU #55