design 2D experiments - Githubissues

davydden commented 6 years ago

once I clean-up https://github.com/davydden/large-strain-matrix-free/pull/34 i will start doing some calculations.

The question to you @masterleinad and @jppelteret is how do we want to design meshes for this study. We probably want to plot time/dof/core as well as total time of MF vs MB. https://github.com/CEED/Forum/issues/1#issuecomment-408285040 the guys did memory calculations based on more or less constant number of DoFs 3e6. Do we want to do the same and, say, start with p=8-th order elements, see what's the coarse mesh to get us to this number and then go down to p=1 but adding global mesh refinements to keep the total DoFs around the same mark?

I plan to run it on 1 node (2 x Xeon 2660v2 Ivy Bridge, 25 MB Shared Cache per chip and 64 GB of RAM) with 20 MPI processes without TBB so that MF/MB comparison is more fair. That's emmy cluster here in Erlangen.

I would assume we want the number of DoFs to be more or less constant and make sure sparse matrix never fits into L3 cache.

The student of ours did such studies with small-strain with

Runtime comparisons of the Cook membrane example were carried out for three to five global refinement steps and finite element degrees of one to four. This resulted in problem sizes ranging from 2,187 DoFs (FE-degree one, three refinements) to 6,440,067 DoFs (FE- degree four, five refinements).

for 3D case of a head model he did up to p=3 degree with

resulting in a problem with 9,143,718 DoFs

likwid-topology -g gives the following ASCI picture for L1-L3 caches:

$ likwid-topology -g
--------------------------------------------------------------------------------
CPU name:   Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPU type:   Intel Xeon IvyBridge EN/EP/EX processor
CPU stepping:   4
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:        2
Cores per socket:   10
Threads per core:   2
....
********************************************************************************
Graphical Topology
********************************************************************************
Socket 0:
+---------------------------------------------------------------------------------------------------------------+
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| |  0 20  | |  1 21  | |  2 22  | |  3 23  | |  4 24  | |  5 25  | |  6 26  | |  7 27  | |  8 28  | |  9 29  | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +-----------------------------------------------------------------------------------------------------------+ |
| |                                                   25 MB                                                   | |
| +-----------------------------------------------------------------------------------------------------------+ |
+---------------------------------------------------------------------------------------------------------------+
Socket 1:
+---------------------------------------------------------------------------------------------------------------+
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| | 10 30  | | 11 31  | | 12 32  | | 13 33  | | 14 34  | | 15 35  | | 16 36  | | 17 37  | | 18 38  | | 19 39  | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +-----------------------------------------------------------------------------------------------------------+ |
| |                                                   25 MB                                                   | |
| +-----------------------------------------------------------------------------------------------------------+ |
+---------------------------------------------------------------------------------------------------------------+

masterleinad commented 6 years ago

I could offer some Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz nodes. Unfortunately, we don't have anything more than AVX2. With respect to https://github.com/CEED/Forum/issues/1 single-node performance is more important than MPI-scalability.

We would probably also see more gain from MatrixFree in the 3D case. So we should definitely do that as well.

We probably want to plot time/dof/core as well as total time of MF vs MB.

That sounds reasonable. We should be able to produce graphs similar (http://www.sppexa.de/fileadmin/user_upload/EXADG.pdf, slide 7). I would definitely try to test with problems as large as we can so they still fit on a node.

davydden commented 6 years ago

I would definitely try to test with problems as large as we can so they still fit on a node.

Ok, then let me start working from that side on say p=8 degree.

I could offer some Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz nodes.

great! I will ping you as soon as I have some input files ready....

Unfortunately, we don't have anything more than AVX2.

same here.

We would probably also see more gain from MatrixFree in the 3D case. So we should definitely do that as well.

absolutely. For now I just wanted to start playing with 2D. Then we would need a 3D counterpart of https://github.com/davydden/large-strain-matrix-free/pull/32

masterleinad commented 6 years ago

For now I just wanted to start playing with 2D. Then we would need a 3D counterpart of #32

I guess you want spherical and not cylindrical holes?

davydden commented 6 years ago

I guess you want spherical and not cylindrical holes?

we can probably just do extrusion indeed and not bother with spherical holes/inclusion. @jppelteret what do you say?

jppelteret commented 6 years ago

If its only for the basis of benchmarking then I think that 3d cylindrical inclusions may be OK. What we could do to add an interesting material response without resorting to a more complex mesh is to take each cylindrical extrusion change the material ID over a part of the extrusion so that each cylindrical particle does not extend the whole way through the domain.

davydden / large-strain-matrix-free

design 2D experiments #35