Will PaStiX4CalculiX work with MPI accross multiple hosts?

feacluster commented 2 years ago

Or is it only meant to be run on a single multi-core machine? I do see there are instructions for how to build Pastix with MPI support here:

https://solverstack.gitlabpages.inria.fr/pastix/md_docs_doxygen_chapters_Pastix_MPI.html

Kabbone commented 2 years ago

At this point this is a PaStiX only feature, meaning in theory you could make the Solving part through MPI, but the CalculiX part is only parallized with pthread/openmp. So it's not going to be worth it in this state, because the solving part should be around a 50% timeframe (depending on the model) of total time calculations.

feacluster commented 2 years ago

Thanks for clarifying! I am going to play around with the MPI feature in Pastix and see what kind of speedups I get on various systems/networks..

A few years ago I managed to integrate the MPI version of Pardiso into Calculix. But there was not much interest in it so I abandoned that project.. If I recall, it was not too complex. I don't think I added more than 50 lines of code to the Calculix source code..

Kabbone commented 2 years ago

btw this PaStiX version is still based on 6.0.2 which I think has not the complete integration of MPI. I was planing to cleanup our commits at some time and rebase it on the latest version, but I don't have a timetable for this yet.

feacluster commented 2 years ago

Thanks, understood. Integrating the MPI version is not something I would attempt until I am very comfortable with how pastix works.. I was browsing the 2.18 source code and am very impressed by all the changes you made to integrate pastix! Was this part of some university project or personal effort?

Anyways, I did some tests of the MPI pastix on a mini cluster of raspberry pis. Running this example ( simple -s 4 -9 "100:100:20" ), it took:

Time to factorize (seconds)	CPUS	Pis
115	1	1
28	4	1
24	8	2

Command I used was:

mpiexec -np 8 --map-by core -hostfile ./hostfile ./examples/simple -s 4 -9 100:100:20

There was no speedup in going to 12 cpus ( 3 pis ). In fact the factorization time was worse than the 4 cpus .

Unfortunately, I am having some issue installing it on a CentOS 7 system. Per the instructions, there is no liblapacke-dev for CentOS. I can only install lapack-devel which I believe may be different. The code compiles but gives errors about openblas memory. And the results are totally wrong.

I presume I will need to contact the pastix developers. Seems I have to email the main person as there doesn't seem a way to open an issue for the public.

Kabbone commented 2 years ago

The project to implement it was combined with a Master thesis (Peter Wauligmann, as seen in the commits) which I tutored. We mainly had to do a lot of changes to cache the memory allocation and reordering.

In my tests (already in CalculiX implemented) I also tried a 64 Core CPU (1 node) with OpenBLAS and the scaling was very limited above 16. I assume in your example you probably also hit a bottleneck on the network side. Did you check if all the nodes a full utilized in this example? I have no experience with the PaStiX MPI yet, but perhaps the decomposition is doesn't split in equal parts.

Opening PaStiX issues was at least possible during our project on their gitlab page at Inria.

Dhondtguido / PaStiX4CalculiX

Will PaStiX4CalculiX work with MPI accross multiple hosts? #10