Open ajensen1234 opened 8 months ago
Implemented in #14 - linking against that branch and pulling into main.
Reopened for GPU
we can asynchronously define B,C,D,E in the convergence pipeline
we're going to need to write a BLAS (possibly) routine for matrix multiplication in order to keep things in the GPU for as long as possible
One aspect of this is that it runs element by element - might actually be a good use for some multithreading and parallelization. Maybe even GPU??