Open feacluster opened 2 years ago
After a lot of trial and error and emails with the Pastix developers, I got something working. I have v2.18 built on an Ubuntu VM with one Tesla K80 GPU. Using nvidia-smi I can confirm the GPU does appear to be working with Pastix.
Unfortunately, the solve times and memory usage are double what I expect from Pardiso or Spooles. If anyone has it built on a linux machine with gpu, I would like to send you a 1 million dof model to benchmark.
The reason for most of my build errors was due to nvidia drivers and cuda toolkit not installed properly. The mistake I was making was not doing the pre-installation steps before running the cuda_10.1.243_418.87.00_linux.run script. I was under the impression the *.run file would do everything for me.
My next hurdle was then installing standalone pastix. I could not get the older version used by Calculix to install. But I was able to get the latest version to build ( 6.2.1 ).
I modified the simple.c example to read a matrix in ijv format from Calculix. From that I confirmed the GPU was being used in the factorization. It did speed things up by adding around 30-40 gflops/second of computation power. For comparison, the 2 cpus on the machine did total 30 gflops/second . So the GPU kind of performed as though the machine had four->six cpus instead of just two.
Last challenge was then to modify the pastix.c file in the Calculix source to use the newest version of pastix solver. Made some minor edits and was able to run some models. Unfortunately, I noticed the models took longer and used more memory than Pardiso. For some reason the matrices are being defined as "General" instead of "Symmetric" . This is resulting in double the expected memory usage and long solve times. Another strange thing is Pastix implementation solves the models a few times instead of once . It seems to do it first in mixed precision then tries double precision .
I am not sure why they did this. Will need to research further.. There are some unsymmetric matrices in Calculix , but I believe they are a minority of cases.
Do I understand correct, that you use the vanilla PaStiX version now? This shouldn't work as expected, because we used things like keeping the reordering and merging Matrix patterns to increase the chance to get rid of the reordering and also keep the memory claimed on the GPU because this takes actually a lot of time. We used General, because LU-Factorisation seemed much more stable for us than Cholesky. Solving multiple times could happen if you can't get a decent Residual in mixed precision with double iterative refinement, then we try to do a double factorisation.
What I already can say is, that frequency analysis will not be faster than pardiso.
But seeing the numbers, your GPU seems too slow, this must be FP64, correct? 30-40 GFLOPs/sec is basically not even one core on a new system.
Thanks for the information! Now some things make more sense. Yes, I did use the current Pastix with Calculix. I had to modify the pastix.c file by commenting out some functions that are not in the current pastix. You can see the changes below. With these changes I can run basic linear and non-linear models. But it gives segmentation fault when trying to run unsymmetric matrix, namely this one:
https://feacluster.com/CalculiX/ccx_2.18/doc/ccx/input_deck_viewer.php?input_deck=axrad.inp
You are right , the GPU is old and slow :-) It's a tesla k80. I will try and re-run my comparisons on some newer hardware now that it working well enough.. However, I notice the 120 refinement steps take a long time to complete. My 1 million equation cantilever beam model has total run time of 60 seconds with Pardiso, but 300+ seconds with Pastix ( using same number of cpus and the old gpu )..
feacluster@gpu2:~/CalculiX/ccx_2.18/src$ diff pastix.c /home/feacluster/CalculiX/CalculiX_Source_Original/ccx_2.18/src/pastix.c
78a79
>
156a158
>
162d163
<
165,167c166,168
< // pastixResetSteps(pastix_data);
< // if(spm->values != aupastix && spm->values != NULL ) free(spm->values);
< // spm->values = aupastix;
---
> pastixResetSteps(pastix_data);
> if(spm->values != aupastix && spm->values != NULL ) free(spm->values);
> spm->values = aupastix;
210,211c211
< // iparm[IPARM_THREAD_NBR] = nthread_mkl;
< iparm[IPARM_THREAD_NBR] = 4;
---
> iparm[IPARM_THREAD_NBR] = nthread_mkl;
221,222c221,222
< // iparm[IPARM_REUSE_LU] = firstIter ? 0 : 1;
< // iparm[IPARM_REUSE_LU] = forceRedo ? 2 : 1;
---
> iparm[IPARM_REUSE_LU] = firstIter ? 0 : 1;
> iparm[IPARM_REUSE_LU] = forceRedo ? 2 : 1;
242d241
< // spm->mtxtype = SpmSymmetric;
250a250
>
442a443
>
444d444
<
472a473
>
486,487c487
< // pastixAllocMemory((void**)&aupastix, sizeof(double) * 1.1 * (nzsTotal * 2 + *neq), gpu);
< aupastix = calloc ( 1.1 * (nzsTotal * 2 + *neq), sizeof ( double ) );
---
> pastixAllocMemory((void**)&aupastix, sizeof(double) * 1.1 * (nzsTotal * 2 + *neq), gpu);
514a515,516
>
>
622,623c624
< // pastixAllocMemory((void**)&aupastix, sizeof(double) * 1.1 * (nzsTotal + *neq), gpu);
< aupastix = calloc ( 1.1 * (nzsTotal + *neq), sizeof ( double ) );
---
> pastixAllocMemory((void**)&aupastix, sizeof(double) * 1.1 * (nzsTotal + *neq), gpu);
720a722
>
895c897
< // if(spm->values == spm->valuesGPU) spm->valuesGPU = NULL;
---
> if(spm->values == spm->valuesGPU) spm->valuesGPU = NULL;
Hello,
I jump into the conversation since @feacluster told me about it.
For Cholesky, have you tried to play with DPARM_EPSILON_REFINEMENT, the default value may be too big for you problems and that may be the reason for your instabilities.
To make sure you don't do the reordering all the time, I have started to look at your code, but it really looks like you have redone something that we already provide with the subtasks, but I didn't finf time yet to look at it into more details. sorry. (As said to @feacluster, we are currently looking for an intern student to work on it if you have candidates)
Now, for the solve part. I'm curious. Which one are you using ? Are you doing SMP computations, or MPI/SMP ?
if SMP, I would recommend to switch from StarPU to dynamic for the solve. Right now we do have an overhead with StarPU that we have to work on. However if you do that, you loose the GPUs.
if MPI/SMP, currently we do not have the full support for the RHS, we are working on it, and the solve is actually done with MPI only which slows it down a lot :(.
I hope some of these issues will be improved in the next release, but for now we do not plan to fix all of them yet. Too many things on our plates.
Hi @mfaverge,
I honestly already assumed that we missed some subtasks back then after I went through some issues and code the last months. My plan was to rework it and stick as much as possible to your vanilla code, but I don't have a plan yet when I find the motivation. What we want to use in short is to keep all memory allocations and be able to update only the values of the same matrix structure. We also keep using the same permutation for bcsc.
For Cholesky I don't remember exactly what we did back then, but I assume we didn't touch DPARM_EPSILON_REFINEMENT, I can check the commit history at work.
We are only using SMP, because MPI was still work in progress back then and typically our use case doesn't scale very well, so normally we don't use above 8 CPUs (also has to do with the CalculiX side). Till now at my company we use your parsec fork because it performed better than StarPU, Dynamic, Static, perhaps this was affected by the overhead you just mentioned. Because of flexiblity (e.g. AMD GPUs) I wanted to look at StarPU again anyway and I assume you don't have time to keep your parsec fork up to date if this is still necessary as you mentioned before that there was a lot going on on the upstream parsec side.
@feacluster The 120 refinement steps means that it doesn't meet the convergence criteria and just keeps trying till the max defined refinement steps. Normally you should reach it within 20 steps, often only like 7, so I assume the much longer runtime comes from this.
Hello @Kabbone, From what you describe, normally you should be able to do it by applying what is done in the step-by-step example.
For PaRSEC, actually I started to look at updating the branch in pastix :). The compilation was not such a big issue and I solved it quickly. But then I encountered many problems with new descriptions of the data transfer in the edges that gave me headache to understand their purpose, so I gave up for now because it's working on shared memory system, but as soon as you have transfers (distributed or heterogeneous), it fails. But I still hope I'll be able to do it in a close future :(. I'll let you know as soon as it is done.
Hello! I would still like to know this. Were you able to fix this error "sopalin/parsec/CMakeFiles/parsec_headers_tgt.dir/build.make:64: *** missing separator."? If that so, can you tell me how?
Sorry, updating the code to match the new PaRSEC compiler was too much effort to be done quickly. It moved to pending issue. As a matter of fact we proposed an internship subject today to work on this. Maybe we will have a student to look into it.
Thank you for the answer! I planned to use StarPU, but first I would like to make the library work with CUDA. PaRSEC I turned off when building, using the CMake option "-DPASTIX_WITH_PARSEC=OFF". Maybe there is a way to build a library for CalculiX, only with CUDA? Forgive me if I don't understand something, I'm new to this :)
I attached my make_pastix.sh, CMake build log file and GNU make log file. Also, I am using the CMake 3.22.1, GNU Make 4.3 and Ubuntu 22.04.2 LTS. logs.zip
Hello,
I'm not sure of what you want to do here, there are many parameteres given to the cmake command that are contradictory or useless.
DPASTIX_WITH_EXTERNAL_SPM=OFF -DSPM_DIR=/usr/local/PaStix/spm/bin
, OFF is the default for the external SPM, so no need to specify it, and if it's off, there is no need to specify the spm directory.Maybe you can define the used BLAS through the BLA_VENDOR variable and make sure that the libraries are in your environment.
One thing that may creates issue is that the library found for the blas is not the same for lapack and tmg:
-- BLAS_LIBRARIES /home/guido/OpenBLAS_i8/lib/libopenblas.a
...
-- LAPACK_LIBRARIES /usr/local/lib/libopenblas.a
...
-- Found TMG: /usr/local/lib/libopenblas.a
Finally, the make error you have, seems weird. It looks like you have some issues with the python script generating the files. I now that it existed at some point, but it was supposed to be fixed.
Hello, I am still having the missing seperator error when run the ./make_pastix.sh, and I haven't seen any fixes online. I'm trying to create a singularity container to run on our HPC server. There is a reference to switching "python3 to python2.7" but I don't know where one would do that.
Here is the definition file I have so far:
Bootstrap: docker
From: nvidia/cuda:12.6.0-devel-ubuntu20.04
%environment
TZ=Etc/UTC
%post
apt-get -y update
DEBIAN_FRONTEND=noninteractive apt-get install -y cmake wget intel-mkl pkg-config build-essential git openmpi-bin openmpi-doc libopenmpi-dev python3-dev python3-pip python2.7 flex bison 2to3 vim
wget https://bootstrap.pypa.io/pip/2.7/get-pip.py
python2.7 get-pip.py
python2.7 -m pip install cython
python3 -m pip install cython
wget http://www.dhondt.de/ccx_2.22.src.tar.bz2
cp ccx_2.22.src.tar.bz2 /usr/local
cd /usr/local
bzip2 -d ccx_2.22.src.tar.bz2
tar xf ccx_2.22.src.tar
mkdir ~/PaStiX
wget https://download.open-mpi.org/release/hwloc/v2.1/hwloc-2.1.0.tar.bz2
mv hwloc-2.1.0.tar.bz2 ~/PaStiX
cd ~/PaStiX
bzip2 -d hwloc-2.1.0.tar.bz2
tar xf hwloc-2.1.0.tar
cp /usr/local/CalculiX/ccx_2.22/src/make_hwloc.sh ~/PaStiX/hwloc-2.1.0
cd ~/PaStiX
git clone https://bitbucket.org/mfaverge/parsec/src/pastix-6.0.2/
mv pastix-6.0.2 parsec
cp /usr/local/CalculiX/ccx_2.22/src/make_parsec.sh ~/PaStiX/parsec
git clone https://gitlab.inria.fr/scotch/scotch.git
cp /usr/local/CalculiX/ccx_2.22/src/make_scotch.sh ~/PaStiX/scotch
git clone https://github.com/Dhondtguido/PaStiX4CalculiX
mv PaStiX4CalculiX pastix_src
cp /usr/local/CalculiX/ccx_2.22/src/make_pastix.sh ~/PaStiX/pastix_src
cd ~/PaStiX/hwloc-2.1.0
sed -i 's/\/home\/guido/$HOME/' make_hwloc.sh
sed -i 's/10\.2/12.6/' make_hwloc.sh
./make_hwloc.sh
cd ~/PaStiX/parsec
sed -i 's/\/home\/guido/$HOME/' make_parsec.sh
sed -i 's/10\.2/12.6/' make_parsec.sh
mkdir -p ~/PaStiX/parsec/build/tools/profiling/python
ln -s ~/PaStiX/parsec/build/tools/profiling/python/pbt2ptt.cpython-38-x86_64-linux-gnu.so ~/PaStiX/parsec/build/tools/profiling/python/pbt2ptt.so
./make_parsec.sh
cd ~/PaStiX/scotch
sed -i 's/\/home\/guido/$HOME/' make_scotch.sh
sed -i 's/10\.2/12.6/' make_scotch.sh
./make_scotch.sh
cd ~/PaStiX/pastix_src
sed -i '23i\ -DCMAKE_LIBRARY_PATH=/usr/local/cuda/lib64/stubs \\' make_pastix.sh
sed -i 's/\/home\/guido/$HOME/' make_pastix.sh
sed -i 's/10\.2/12.6/' make_pastix.sh
./make_pastix.sh
Indeed, everything has been updated to python3 on our side normally. Maybe there are some remaining leftover in the old parsec tag that causes troubles.
When running make_pastix.sh, I am receiving the following error: