Parallel performance - Githubissues

gkovacsds commented 1 year ago

Dear Mr.Chen, we have tested your CKTSO matrix solver in our circuit simulation software and generally liked the performance. We tried it on AMD/Intel Windows platforms, with CSR formatted matrices (row-major). However we did not achieve multi-thread performance improvement over single-thread mode, only slowdown. We did transient simulations - many refactorization and solve calls. We think we are using the library as recommended in the user guide.

To check that we are doing everything properly, could you help in these:

if we send you a number of exported MTX files with RHS vectors, can you evaluate if there's a multi-thread speed gain in your test environment? Do you need anything else?
are there some statistical dll calls or output values to see if CKTSO has really decided to use multiple threads or used sequential solving?

Thank you and best regards, Gergely

chenxm1986 commented 1 year ago

You may check the following:

Check the system workload and the number of threads passed to CKTSO_Analyze. It should be guaranteed that the number of free cores is larger than or at least equal to the number of threads of CKTSO.
If the above condition holds and the situation does not change, simply use demo/benchmark.cpp to perform benchmarking test on dumped mtx file.

If the above two tests do not change the situation, please send me the mtx and rhs files and I will test them.

For the current version (20230325), oparm[27] (for row mode, i.e., Ax=b) and oparm[28] (for column mode, i.e., (A^T)x=b) reflect the solving algorithms. Their meaning is a little complex. They are two-digit numbers in decimalism. The least significant digit is the solving algorithm for L, and the other digit is for U. 0 means sequential and 1,2,3 mean the corresponding parallel algorithms. For example, if oparm[27]=20, it means the row mode solving uses sequential algorithm for Ly=b and uses the 2nd parallel algorithm for Ux=y.

gkovacsds commented 1 year ago

Thank you for the answer, it took some time to check all I could. For some reason the benchmark program behaves differently than my test program, though I pass the number of threads to CKTSO_Analyze. It never uses parallel mode (oparm[27] is always 0). The add20 matrix file also solves not parallel. iparm is the same, oparm is only different in [27] (and in the benchmark times).

My Windows test program is not in Visual Studio (Delphi), and uses the DLL through the C interface. The Benchmark program uses the ICktSo interface - could this difference be the problem? Please advise how I could enable the use of parallel solving? Is it restricted on DLL only use? Do I have to use the C lib for it?

Also may I ask if CKTSO uses custom created threads or it is based on a ready-made solution like OpenMP?

Anyway I modified the benchmark program so it can read unsorted MTX files too. It may be useful for you too. It can also read more mtx files (with the same matrix structure), and solves them all repeatedly. Can also read RHS files, so I could also check solution validity compared with other solvers. CKTSO was solving well of course.

On this page you can find the modified benchmark VS project, if interested. I also collected the files documenting my test: CSR matrix dumps, Screenshot that 7 additional threads are spawn (for 8 threads run), but solving is then not parallel (oparm[27] is 0) Some logs with iparm, oparm and VS debug screens.

Thank you for any suggestions.

chenxm1986 commented 1 year ago

Thank you for providing the detailed information. I first provide some quick answers. After I try your cases, I will answer other questions. The C interface and C++-like interface behave exactly the same. However, linux and windows libraries may behave differently. Whether to use parallel solving depends on many factors and CKTSO has some automatic determination methods. I will provide more information later. CKTSO uses native API, i.e., pthread/Windows API, for thread management. No openmp is used. From your reply, it looks that the issues are focused on parallel solving? Do you have performance issues in parallel factorization or refactorization?

gkovacsds commented 1 year ago

My main problem is, that checking oparm[27] after CKTSO_Solve, it is zero even with the example matrix add20.mtx I don't know why it chooses sequential solve when I pass the number of threads to CKTSO_Analyze. I use the 14th May DLL. With the benchmark cpp it is OK, but my test program is always 0 for oparm[27]

gkovacsds commented 1 year ago

If you need my test program , I attached it. Maybe you can inspect the dll calls and how cktso.dll works with that. Just click on "Solve!". matrix solver test.zip

gkovacsds commented 1 year ago

Dear Mr.Chen, I send 3 matrix files then, please look at the zip file attached here. I also attach the matrix structure drawings. We only have success with "20knodes...", it switches to parallel solving and it's faster. With the others the solver seems to choose sequential solving. The "NMOSx8..." example seems quite structured, still not possible to switch to parallel solving? The "TPS40140..." is a more realistic circuit matrix, quite dispersed though. Can we have manual control on which solver to choose, or will it always be automatic? I see speedup in matrix analysis when enabling multiple threads, but not in factorization or solving. mtx.zip nmosx8 20k

chenxm1986 commented 1 year ago

Thank you so much for providing your test cases. I have tried your cases. Here are some notes for the observations.

CKTSO has automatic methods to decide whether to use parallel factorization, refactorization and solving or not. They generally depend on the matrix dimension and the sparsity of L and U. If the matrix is small, or L and U are too sparse, CKTSO tends to use sequential algorithms. The reason is that small matrix or sparse factors lead to too little relative computation, with respect to other overheads like parallel scheduling and memory access. Using parallel algorithms can lead to lower performance. In your cases, TPS is a small case. NMOS is very sparse, it only has on average 1.53 floating-point operations per L/U nonzero element, while this metric is 96 for 20knodes. This is why CKTSO uses parallel algorithms only for 20knodes.
iparm[9] controls the automatic methods. You can simply disable iparm[9] to test parallel algorithms.
The first time parallel solving includes other overheads to build the parallel scheduler, so please test subsequent solvings. The same is for factorization and refactorization.
Even for sequential solving, the solving time following parallel factorization/refactorization may be a little longer than that following sequential factorization/refactorization. This is because parallel factorization/refactorization uses different order to store L and U which lowers cache hit.
Solving is much more difficult to parallelize than refactorization. So there are some cases for which refactorization is parallel but solving is sequential, according to the auto-selected methods of CKTSO. Let me know if you have any further questions.

chenxm1986 / cktso

Parallel performance #12