Open hcq9102 opened 10 months ago
Hi @mhaseeb123 Thanks for asking:
- Is this code taken/modified from lapack or blas examples or written from scratch. I see a few C style code and
_WIN32
relics that seem to be coming from somewhere else?
This code modified a lot mainly from Intel open source project and other online sources.
- Maybe I am not in the loop but I thought we were implementing a
stdexec
/omp
version of the cholesky? From the last meeting, I remember you had a design ready for at least 3x3 matrices that you were implementing?
Yes, we were trying to implement a stdexec
/omp
version of the cholesky.
And if you remember the paper "Task-Based Cholesky Decomposition on Knights Corner using OpenMP" and it talks about the OpenMP Task-Based Cholesky Decomposition. and other papers about the tiled_cholesky decomposition.
So according to its pseudo code in the paper, first, we have to have the tiled_cholesky decomposition algrithm for constructing task-based graph(for openmp).
Once we have this tiled_cholesky decomposition algorithm works, then we can implementing 'omp' version by adding openmp syntax.
And last time meeting, I had a design ready for at least 3x3 matrices works for this tiled_cholesky decomposition, i was working on verifying the correctness and generalize the code as much as possible.
- I don't get the point of using lapack or blas to build cholesky as those libraries already contain methods to do that directly: https://nalgebra.org/docs/user_guide/decompositions_and_lapack/
As I mention in Question2, this implementation is for tiled_cholesky using functions in lapack or blas.
Yes, lapack or blas provide the LAPACKE_dpotrf() to do this directly but not in sub_blocks way.
As we have seen, the nvdia's cholesky implementation using their advanced library #include <experimental/linalg>, this is also a block cholesky algorithm which with fix 4 sub_blocks.
I cannot find a way to do (task_based )'omp' if we only have previous implemented code.
- Does this code run in serial or parallel fashion? If serial, what's the advantage of doing this over the serial version we already have?
currently, I am running the code with gpu. I didn't test the performance between this and serial.
- Generally, I am not in favor of adding so many external dependencies (lapack/blas) in this repository unless they are absolutely needed for functionality of our code.
Totally understand. I also worry about this external dependencies issue at the very beginning. but I didnt find a good way to support this kind of implementation after two weeks literature reviews on this. This algorithm has heavy data dependencies, I didnt see much innovation methods to implement. and cannot write these support functions(LAPACKE_dpotrf, cblas_dtrsm, cblas_dsyrk, cblas_dgemm) easily from scratch.
Of course, we can discuss wether we need to continue 'omp' implementation based on this PR and using external dependencies (lapack/blas) as well or not. I appreciate any suggestions on task-based implementation for this algorithm
Hi @hcq9102, thank you for taking the time to respond to questions.
That's fine but please include original licenses in files as well if required for transparency and legal reasons.
I would recommend also using #include <experimental/linalg/..>
as it is natively available in our NVC++ compiler instead of adding external lapack and blas libraries.
As you have a task-based parallel design ready (at least for 3x3 matrix), I would strong recommend implementing it with stdexec and #include <experimental/linalg>
first. Doing so will allow you to work with stdexec
and use CPUs/GPUs similar to other apps without using external dependencies.
If possible, I would like to see some performance and correctness results before merging this into the main
branch
Hi @mhaseeb123 Thanks for your comments!
- I would recommend also using
#include <experimental/linalg/..>
as it is natively available in our NVC++ compiler instead of adding external lapack and blas libraries.
Weile and I have discussed the possibility of using <experimental/linalg/..> in task-graph imp one month ago, but there is one function missing in <experimental/linalg/..>, so have to turn to lapack. And its not that good if mixed <experimental/linalg/..> and lapack. But I will check the possibility of using #include <experimental/linalg/..> again.
- As you have a task-based parallel design ready (at least for 3x3 matrix), I would strong recommend implementing it with stdexec and
#include <experimental/linalg>
first. Doing so will allow you to work withstdexec
and use CPUs/GPUs similar to other apps without using external dependencies.
I will try to do senders/receivers if possible, but there is external lib involved issue.
- If possible, I would like to see some performance and correctness results before merging this into the
main
branch.
I have verified the correctness with lapack function which calculate cholesky decomposition directly. Yes, Will have Performance test.
General question: I noticed that we are accessing matrices via C-style indexing at several locations. For example: matrix[row * n + col] =
syntax. Would it make more sense to use mdspan
s for better syntax and readability?
General question: I noticed that we are accessing matrices via C-style indexing at several locations. For example:
matrix[row * n + col] =
syntax. Would it make more sense to usemdspan
s for better syntax and readability?
I believe the openblas library is C-style and cannot figure out mdspan.
tiled_cholesky implementation.
BUILD NOTE: build openblas library as needed. build OpenBlas script
include openblas library when build or $ export OPENBLAS_DIR=/openblas/path
TODO: