This PR attempts to leverage more batch operations on the GPU by refactoring the extraction of individual patches into a stack of patches. It's a little tricky because the patches on the ccd do not have the same shape. The patches do stack cleanly once the projection and weight matrices are applied, so the heavy linear algebra operations (solve and eigh) can be performed in batch.
The tables below shows before and after results using the 30 frame exposure extract script using a single node with 4 GPUs and 2 MPI ranks per GPU on corigpu (5 MPI ranks per GPU on dgx).
This PR attempts to leverage more batch operations on the GPU by refactoring the extraction of individual patches into a stack of patches. It's a little tricky because the patches on the ccd do not have the same shape. The patches do stack cleanly once the projection and weight matrices are applied, so the heavy linear algebra operations (solve and eigh) can be performed in batch.
The tables below shows before and after results using the 30 frame exposure extract script using a single node with 4 GPUs and 2 MPI ranks per GPU on corigpu (5 MPI ranks per GPU on dgx).
Before:
This PR: