tiled_cholesky w/o openmp

hcq9102 commented 10 months ago

tiled_cholesky implementation.

BUILD NOTE: build openblas library as needed. build OpenBlas script

include openblas library when build or $ export OPENBLAS_DIR=/openblas/path

TODO:

[x] add args ： matrix_size, num_tiles， int verifycorrectness = 1; bool layRow = true;

hcq9102 commented 10 months ago

Hi @mhaseeb123 Thanks for asking:

Is this code taken/modified from lapack or blas examples or written from scratch. I see a few C style code and _WIN32 relics that seem to be coming from somewhere else?

This code modified a lot mainly from Intel open source project and other online sources.

Maybe I am not in the loop but I thought we were implementing a stdexec/omp version of the cholesky? From the last meeting, I remember you had a design ready for at least 3x3 matrices that you were implementing?

Yes, we were trying to implement a stdexec/omp version of the cholesky. And if you remember the paper "Task-Based Cholesky Decomposition on Knights Corner using OpenMP" and it talks about the OpenMP Task-Based Cholesky Decomposition. and other papers about the tiled_cholesky decomposition.

So according to its pseudo code in the paper, first, we have to have the tiled_cholesky decomposition algrithm for constructing task-based graph(for openmp).

Once we have this tiled_cholesky decomposition algorithm works, then we can implementing 'omp' version by adding openmp syntax.

And last time meeting, I had a design ready for at least 3x3 matrices works for this tiled_cholesky decomposition, i was working on verifying the correctness and generalize the code as much as possible.

I don't get the point of using lapack or blas to build cholesky as those libraries already contain methods to do that directly: https://nalgebra.org/docs/user_guide/decompositions_and_lapack/

As I mention in Question2, this implementation is for tiled_cholesky using functions in lapack or blas.
Yes, lapack or blas provide the LAPACKE_dpotrf() to do this directly but not in sub_blocks way.

As we have seen, the nvdia's cholesky implementation using their advanced library #include <experimental/linalg>, this is also a block cholesky algorithm which with fix 4 sub_blocks.

I cannot find a way to do (task_based )'omp' if we only have previous implemented code.

Does this code run in serial or parallel fashion? If serial, what's the advantage of doing this over the serial version we already have?

currently, I am running the code with gpu. I didn't test the performance between this and serial.

Generally, I am not in favor of adding so many external dependencies (lapack/blas) in this repository unless they are absolutely needed for functionality of our code.

Totally understand. I also worry about this external dependencies issue at the very beginning. but I didnt find a good way to support this kind of implementation after two weeks literature reviews on this. This algorithm has heavy data dependencies, I didnt see much innovation methods to implement. and cannot write these support functions(LAPACKE_dpotrf, cblas_dtrsm, cblas_dsyrk, cblas_dgemm) easily from scratch.

Of course, we can discuss wether we need to continue 'omp' implementation based on this PR and using external dependencies (lapack/blas) as well or not. I appreciate any suggestions on task-based implementation for this algorithm

mhaseeb123 commented 10 months ago

Hi @hcq9102, thank you for taking the time to respond to questions.

That's fine but please include original licenses in files as well if required for transparency and legal reasons.
I would recommend also using #include <experimental/linalg/..> as it is natively available in our NVC++ compiler instead of adding external lapack and blas libraries.
As you have a task-based parallel design ready (at least for 3x3 matrix), I would strong recommend implementing it with stdexec and #include <experimental/linalg> first. Doing so will allow you to work with stdexec and use CPUs/GPUs similar to other apps without using external dependencies.
If possible, I would like to see some performance and correctness results before merging this into the main branch

hcq9102 commented 10 months ago

Hi @mhaseeb123 Thanks for your comments!

I would recommend also using #include <experimental/linalg/..> as it is natively available in our NVC++ compiler instead of adding external lapack and blas libraries.

Weile and I have discussed the possibility of using <experimental/linalg/..> in task-graph imp one month ago, but there is one function missing in <experimental/linalg/..>, so have to turn to lapack. And its not that good if mixed <experimental/linalg/..> and lapack. But I will check the possibility of using #include <experimental/linalg/..> again.

As you have a task-based parallel design ready (at least for 3x3 matrix), I would strong recommend implementing it with stdexec and #include <experimental/linalg> first. Doing so will allow you to work with stdexec and use CPUs/GPUs similar to other apps without using external dependencies.

I will try to do senders/receivers if possible, but there is external lib involved issue.

If possible, I would like to see some performance and correctness results before merging this into the main branch.

I have verified the correctness with lapack function which calculate cholesky decomposition directly. Yes, Will have Performance test.

mhaseeb123 commented 10 months ago

General question: I noticed that we are accessing matrices via C-style indexing at several locations. For example: matrix[row * n + col] = syntax. Would it make more sense to use mdspans for better syntax and readability?

weilewei commented 10 months ago

General question: I noticed that we are accessing matrices via C-style indexing at several locations. For example: matrix[row * n + col] = syntax. Would it make more sense to use mdspans for better syntax and readability?

I believe the openblas library is C-style and cannot figure out mdspan.

NERSC / hpcpp

tiled_cholesky w/o openmp #27