JuliaFEM / JuliaFEM.jl

The JuliaFEM software library is a framework that allows for the distributed processing of large Finite Element Models across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
http://juliafem.github.io/JuliaFEM.jl/latest/
MIT License
251 stars 66 forks source link

Parallel computing #15

Closed ahojukka5 closed 8 years ago

ahojukka5 commented 9 years ago

Like almost all others we could start with Trilinos. It has Python bindings so I guess it's usable in Julia too with some effort. Unfortunately nothing compiles for me but maybe some other has better luck. (If you succeed, please share).

Links

[1] http://trilinos.org/oldsite/packages/pytrilinos/UsersGuide.pdf (1d poisson problem in p. 14) [2] http://www.researchgate.net/publication/231168298_Parallel_FEM_Application_Development_in_Python [3] http://www.sciencedirect.com/science/article/pii/S0309170811000777 [4 ]http://www.scientificpython.net/pyblog/domain-decomposition-alternating-schwarz [5] http://calcul.math.cnrs.fr/Documents/Ecoles/ET2011DD/VDolean1.pdf [6] https://github.com/dpo/MUMPS.jl [7] https://github.com/JuliaSparse/Metis.jl [8] https://github.com/JuliaGraphs/LightGraphs.jl [9] http://www.scs.leeds.ac.uk/pkj/Papers/Books/JT98.pdf [10] http://www.mcs.anl.gov/petsc/documentation/tutorials/jhouchins_writeup_revised.pdf (p. 20) [11] http://www.math.tamu.edu/%7Ebangerth/publications/2010-parsim2010.pdf (1000 core results) [12] http://p4est.github.io/papers/BangerthBursteddeHeisterEtAl11.pdf [13] http://tu-dresden.de/die_tu_dresden/fakultaeten/fakultaet_mathematik_und_naturwissenschaften/fachrichtung_mathematik/institute/wir/dissertation/thomas_witkowski [14] https://courses.engr.illinois.edu/cs554/fa2013/notes/10_iterative_8up.pdf [15] https://github.com/JuliaParallel [16] http://libelemental.org [17] http://www.math.tamu.edu/~bangerth/videos/676/slides.41.pdf

[18] http://www.cats.rwth-aachen.de/teaching/para-2013 (Parallel computing for computational mechanics, lecture notes)

[19] http://mat.fsv.cvut.cz/komisevstez/13/prispevky/sousedik.pdf (Finite Element Solution Algorithm for Nonlinear Elasticity Problems by Domain Decomposition Method)

[20] https://en.wikipedia.org/wiki/BDDC [21] https://en.wikipedia.org/wiki/FETI-DP [22] https://en.wikipedia.org/wiki/Balancing_domain_decomposition_method

[23] http://www.supercomp.org/sc2003/paperpdfs/pap238.pdf (IPSAP : A High-performance Parallel Finite Element Code for Large-scale Structural Analysis Based on Domain-wise Multifrontal Technique)

[24] http://am2013.math.cas.cz/proceedings/contributions/vejchodsky.pdf (A DIRECT SOLVER FOR FINITE ELEMENT MATRICES REQUIRING O(N log N ) MEMORY PLACES)

[25] http://ocw.mit.edu/resources/res-2-002-finite-element-procedures-for-solids-and-structures-spring-2010/linear/lecture-9/MITRES2_002S10_lec09.pdf (Solution of finite element equilibrium equations in static analysis)

[26] http://infolab.stanford.edu/pub/cstr/reports/na/m/80/06/NA-M-80-06.pdf (A NEW IMPLEMENTATION OF SPARSE GAUSSIAN ELIMINATION)

TeroFrondelius commented 9 years ago

:+1:

ovainola commented 9 years ago

I also tried to build Trilinos but I was in no luck. My build command failed at this point

[ 38%] Building CXX object packages/seacas/libraries/ioss/src/transform/CMakeFiles/Iotr.dir/Iotr_VectorMagnitude.C.o
[ 38%] Building CXX object packages/seacas/libraries/ioss/src/transform/CMakeFiles/Iotr.dir/Iotr_Initializer.C.o
[ 38%] Building CXX object packages/seacas/libraries/ioss/src/transform/CMakeFiles/Iotr.dir/Iotr_MinMax.C.o
Linking CXX static library libIotr.a
[ 38%] Built target Iotr
Scanning dependencies of target Ionit
[ 38%] Building CXX object packages/seacas/libraries/ioss/src/init/CMakeFiles/Ionit.dir/Ionit_Initializer.C.o
Linking CXX static library libIonit.a
[ 38%] Built target Ionit
Scanning dependencies of target cth_pressure_map
[ 38%] Building CXX object packages/seacas/libraries/ioss/src/main/CMakeFiles/cth_pressure_map.dir/cth_pressure_map.C.o
[ 38%] Building CXX object packages/seacas/libraries/ioss/src/main/CMakeFiles/cth_pressure_map.dir/vector3d.C.o
Linking CXX executable cth_pressure_map
/usr/bin/ld: cannot find -lIoexo_fac
collect2: error: ld returned 1 exit status
make[2]: *** [packages/seacas/libraries/ioss/src/main/cth_pressure_map] Error 1
make[1]: *** [packages/seacas/libraries/ioss/src/main/CMakeFiles/cth_pressure_map.dir/all] Error2
make: *** [all] Error 2

Still no clue what is Ioexo_fac library, but gotta google it...

ahojukka5 commented 9 years ago

My attempt with options (cloned from github repo)

cmake -DTPL_ENABLE_MPI=ON -DTrilinos_ENABLE_ALL_PACKAGES=ON -DCMAKE_INSTALL_PREFIX=/opt/trilinos -DTrilinos_ASSERT_MISSING_PACKAGES=OFF -DTPL_ENABLE_Matio=OFF -DTPL_ENABLE_GLM=OFF ../trilinos

...

[ 16%] Building C object packages/seacas/libraries/exodus/cbind/CMakeFiles/exodus.dir/src/ex_create_par.c.o /home/jukka/dev/trilinos/packages/seacas/libraries/exodus/cbind/src/ex_create_par.c:130:24: fatal error: netcdf_par.h: No such file or directory

include "netcdf_par.h"

                    ^

compilation terminated. make[2]: * [packages/seacas/libraries/exodus/cbind/CMakeFiles/exodus.dir/src/ex_create_par.c.o] Error 1 make[1]: * [packages/seacas/libraries/exodus/cbind/CMakeFiles/exodus.dir/all] Error 2 make: *\ [all] Error 2

ovainola commented 9 years ago

I also cloned from git repo (https://github.com/trilinos/trilinos)

all the commands I did:

> git clone https://github.com/trilinos/trilinos.git
> mkdir trilinos_build && cd trilinos_build
> cmake -DCMAKE_C_COMPILER=/usr/bin/gcc -DCMAKE_CXX_COMPILER=/usr/bin/g++ 
 -DCMAKE_Fortran_COMPILER=/usr/bin/gfortran -DTrilinos_ENABLE_ALL_PACKAGES=ON 
-DCMAKE_INSTALL_PATH=/home/olli/programming/trilinos_build 
-DTPL_ENABLE_Netcdf=OFF -DTrilinos_ASSERT_MISSING_PACKAGES=OFF /home/olli
/programming/trilinos
> make

I was missing this Netcdf library, which i silenced with flag -DTPL_ENABLE_Netcdf=OFF so I guess that might be my problems source..

ahojukka5 commented 9 years ago

[ 24%] Building CXX object packages/seacas/libraries/ioss/src/main/CMakeFiles/cth_pressure_map.dir/vector3d.C.o Linking CXX executable cth_pressure_map /usr/bin/ld: cannot find -lIoexo_fac collect2: error: ld returned 1 exit status

TeroFrondelius commented 9 years ago

Library worth looking: https://github.com/dpo/MUMPS.jl

ahojukka5 commented 9 years ago

https://github.com/JuliaSparse/Metis.jl

Metis is already wrapped to Julia.

ahojukka5 commented 9 years ago

I'm not a parallel computing expert so I really don't know how silly this idea is, but after reading the VDolean slides it seems that if we really don't care about convergence speeds and want to just distribute computing, classical block Jacobi algorithm does the job and is might be quite easy to implement. There's some implementation examples in Wikipedia, https://en.wikipedia.org/wiki/Jacobi_method. It has some recent developments which could accelerate the convergence while keeping the method simple (http://phys.org/news/2014-06-19th-century-math-tactic-makeoverand.html, http://engineering.jhu.edu/fsag/wp-content/uploads/sites/23/2013/10/JCP_revised_WebPost.pdf). After fast-reading the proposed method is suitable especially for elliptical PDE's. But still I have no idea is the convergence comparable to "modern" algorithms. Anyway this could be our naive strategy to distribute computing to several cores for first release unless we have some better ideas.

"Simple is better than complex."

TeroFrondelius commented 9 years ago

Here is also some Metis functionality: https://github.com/JuliaGraphs/LightGraphs.jl

TeroFrondelius commented 9 years ago

Maybe this is a helpful article of the topic: http://www.scs.leeds.ac.uk/pkj/Papers/Books/JT98.pdf

TeroFrondelius commented 9 years ago

Here is code starting from page 20: http://www.mcs.anl.gov/petsc/documentation/tutorials/jhouchins_writeup_revised.pdf

TeroFrondelius commented 9 years ago

This is a deal.II paper with 1000 cores results: http://www.math.tamu.edu/~bangerth/publications/2010-parsim2010.pdf

TeroFrondelius commented 9 years ago

This seems a relevant paper as well: http://p4est.github.io/papers/BangerthBursteddeHeisterEtAl11.pdf

TeroFrondelius commented 9 years ago

This could be relevant: http://tu-dresden.de/die_tu_dresden/fakultaeten/fakultaet_mathematik_und_naturwissenschaften/fachrichtung_mathematik/institute/wir/dissertation/thomas_witkowski

ahojukka5 commented 9 years ago

Morning

I made a simple example describing one possible method which could be used to parallelize solution, see https://github.com/JuliaFEM/JuliaFEM.jl/blob/master/notebooks/2015-06-30-parallel-solution-using-substructuring.ipynb

It's not realistic because it uses matrix inversions but I think it has the "idea" how it could be done. We first solve the boundary system by eliminating the interior nodes and after that solve a much smaller system. It's kind of a variant described in paper [19]. I haven't finished with this yet. It seems that in paper the author ends up to more elegant solution.

ahojukka5 commented 9 years ago

One very important requirement came in my mind a second ago. Do we want to develop parallel implementation that can later on be implemented for GPGPU computations easily? I bet that in next 10 years we are seeing a lot of GPGPU computation.

http://www.rdi.uwa.edu.au/__data/assets/pdf_file/0005/1524848/Finite-Element-Method.pdf http://web.stanford.edu/group/lavxm/publications/journal-papers/CeLeDa11.pdf

etc..

ahojukka5 commented 9 years ago

Instead of having iterative parallel solver we could also do a direct parallel solver. See [23] for results. Interestingly it seems that both ABAQUS and Code Aster which are both capable to solve very large systems relay on direct solvers as they main solver. It of course requires more memory but can also solve stiff problems typical in the field of computational mechanics. Recursive / hierarchical multifrontal methods seems to be very promising approach.

From http://www.mathworks.com/matlabcentral/newsreader/view_thread/170254:

"is to use a factored matrix object for K_bb^-1 (very critical if you want to have 100 000 DOFs). Then, you solve for Kbb^-1*Kba by piece by block. Then you project the matrices.

You have to realize that 1e5_5000_8/1023^3= 3.7 GB If you want that to happen on a 32 bit machine you need to go out of core (like SDT and ANSYS do)."

According to our road map we specifically want to solve 100 million DOF using 1000 cores, i.e. 100 000 DOF/core. The question is how much of memory / core we need then? Anyway I think that both network speed and amount of memory / core is increasing quickly in cloud based clusters.

TeroFrondelius commented 9 years ago

I made a MUMPS download request here: http://mumps.enseeiht.fr/index.php?page=dwnld#form

ahojukka5 commented 9 years ago

Let's start to investigate more closely this possibility to use MUMPS to parallel our stuff. At least is seems to work in static cases for solid mechanics.

TeroFrondelius commented 9 years ago

I'm looking to get the MUMPS build.jl up to date. Some success already.

TeroFrondelius commented 9 years ago

What I have been thinking is something like this. We would make the area decomposition as described above. We will try to use pure julia parallelism for that. Then each of those areas are solved with parallel MUMPS direct solver in shared memory. What do you think?

ahojukka5 commented 9 years ago

What do you mean by shared memory? After decomposition the "boundary parts" are in different machines which are not able to communicate between each other but just to head node. But yeah, we could try to use Julia's native parallel implementation and trying to make our parallel solution so general that it could be used by other projects too.

TeroFrondelius commented 9 years ago

Inside one area you can have shared memory. I mean to use all cores of one machine for the direct solver. In my above proposal the direct solver is MUMPS. Currently I don't know if it works like that, but the idea is to get best of both worlds: Message Passing style between boundaries and OpenMP style inside the area.

TeroFrondelius commented 9 years ago

Something very interesting to follow: https://github.com/JuliaParallel/Elemental.jl

See the original C++ version with documentation: https://github.com/elemental/Elemental

ahojukka5 commented 9 years ago

Yes that sounds reasonable to use OpenMP + MPI. I think that's called hybrid parallelism. If we want to assemble global stiffness matrix in shared memory we need to "color" our degrees of freedom and it's a bit more complicated that pure MPI implementation. But not too much complicated compared to the benefits.

ahojukka5 commented 9 years ago

A little update to here. It seems that ABAQUS is using sparse gaussian elimination for direct solving. I added [26] which is describing the ideas behind algorithm. Another little update is that julias -operator has threading support, it support sparse matrices and it's actually quite fast. (To be more specific, it's the factorize command which have threading support, http://julia.readthedocs.org/en/latest/stdlib/linalg/). Solving 140000 dof problem took about 30 seconds and 7.1 GB of memory.

TeroFrondelius commented 9 years ago

Possible interesting topic at julia-users: https://groups.google.com/forum/#!topic/julia-users/a7uk5Sg_cSs

ahojukka5 commented 9 years ago

http://scicomp.stackexchange.com/questions/540/what-is-the-advantage-of-multigrid-over-domain-decomposition-preconditioners-an/546

TeroFrondelius commented 9 years ago

Viral B Shah - The many ways of parallel computing with Julia: https://www.youtube.com/watch?v=HCcO-715acM&feature=youtu.be

ahojukka5 commented 9 years ago

After studying this subject I think I have now a clear idea for our first attempt. We basically distribute subdomains to nodes using julia's parallel implementation, do LU-factorization locally on clients, fetch interface problems back to host and solve it there directly. This should be actually quite simple to do.

Any ideas how to do, using julia's parallel implementation:

  1. store custom type MyModel to every calculation node with different data
  2. manipulate this (local) data somehow and let modified result to cpu
  3. fetch results

The key point here is that data, this case, factorized matrices, must be preserved in client machines during operations because they are needed later on for back substitution.

TeroFrondelius commented 9 years ago

Maybe relevant discussion: https://groups.google.com/forum/#!topic/julia-users/E8fGIiDwckc

ahojukka5 commented 8 years ago

http://crd-legacy.lbl.gov/~xiaoye/vecpar10.pdf

TeroFrondelius commented 8 years ago

This is now working through PETSc.