Question: Using amgcl with MPI

NOhs commented 6 years ago

So I am reading through the benchmark distributed Poisson equation implementation to figure out how I could have the linear elasticity problem I have with amgcl run on several MPI nodes. But I still have some questions I was hoping you might answer:

How do I split up my matrix on each node? Assuming I have hexahedral elements. Do I create smaller matrices for sub-volumes of the full image?
Do I have one node in each direction of overlap?
Are the matrices on each node stored with local indices, or do I store the global index for each element?
Do I need to tell amgcl the overlap somewhere? I couldn't find anything hinting at that.
Is it also possible for the MPI version to use my custom matrix class as input (which is working for the shared memory version)?

ddemidov commented 6 years ago

I do need to find time to provide a better documentation for the MPI functionality.

How do I split up my matrix on each node? Assuming I have hexahedral elements. Do I create smaller matrices for sub-volumes of the full image?

You should use a library like metis or scotch to partition your system. The matrix then needs to be reordered according to the partitioning. AMGCL assumes that the system matrix has been reordered already, and each MPI domain owns a contiguous chunk of matrix rows. The column numbers in the matrix are global (each row completely belongs to a single MPI process)

Do I have one node in each direction of overlap?

I am sorry, I do not understand this question.

Are the matrices on each node stored with local indices, or do I store the global index for each element?

Global column numbers are used.

Do I need to tell amgcl the overlap somewhere? I couldn't find anything hinting at that.

Each MPI process sends its own chunk of matrix rows to AMGCL. Since AMGCL assumes the matrix has been reordered already, it can deduce the global matrix size. Not sure though if this was the question?

Is it also possible for the MPI version to use my custom matrix class as input (which is working for the shared memory version)?

Yes, as long as the custom matrix class returns rows according to the above points.

NOhs commented 6 years ago

I see, I'll have a look at it. Thank you!

NOhs commented 6 years ago

Ok, so using a partitioner seems like overkill for the simple domain I want to partition (a cube of hexahedral elements). Let's say I have divided my domain that is, each dof of my system belongs to one sub-domain, except for the ones at the border between sub-domains which belong to all those sub-domains that share this border. Now I created a sparse matrix for each MPI node. Each MPI node owns one sub-domain. That means in its sparse matrix, this MPI node has the rows of each dof that belongs to this MPI node's sub-domain. That means that certain rows appear in multiple MPI nodes. In the sparse matrix, the column-numbers that are stored are the global ones meaning they correspond to the case of no MPI.

My question: You said that AMGCL assumed the matrix has been already reordered. So what else do I have to do? Can I distribute my partitions to the different MPI ranks arbitrarily?

Thanks for your help so far. Even though I am asking a lot of questions I am really impressed how much complexity AMGCL actually hides from me, e.g. switching between openMP and openCL is trivial!

ddemidov commented 6 years ago

May be a simple example would be of help here. Let's say you have a 4x4 global matrix:

|  2 -1  0  0 |
| -1  2 -1  0 |
|  0 -1  2 -1 |
|  0  0 -1  2 |

If you have two MPI processes, then the matrix would be split between those as:

Process 0

|  2 -1  0  0 |
| -1  2 -1  0 |

Process 1

|  0 -1  2 -1 |
|  0  0 -1  2 |

Now, let's say you have a 4x4 2D grid, on which you want to solve a Poisson problem. The grid points (dofs) are numbered like this originally:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

If you have 4 MPI domains, you could partition the problem as

 0  1 |  2  3
 4  5 |  6  7
--------------
 8  9 | 10 11
12 13 | 14 15

and then renumber the dofs so that each block has continuous chunk of dofs:

 0  1 |  4  5
 2  3 |  6  7
--------------
 8  9 | 12 13
10 11 | 14 15

The parts of the finite-difference-discretized 16x16 matrix owned by each of the MPI processes may be written as following (sparse notation is used, where (c: v) means there is nonzero value v at column c):

Process 0

row 0: (0: 4) (1: -1) (2: -1)
row 1: (0: -1) (1: 4) (4: -1)* (3: -1)
row 2: (0: -1) (2: 4) (3: -1) (8: -1)*
row 3: (1: -1) (2: -1) (3: 4) (6: -1)* (9: -1)*

Process 1

row 0: (1: -1)* (4: 4) (5:-1) (6: -1)
row 1: (4: -1) (5: 4) (7: -1)
row 2: (4: -1) (3: -1)* (6: 4) (7: -1) (12: -1)*
row 3: (5: -1) (6: -1) (7: 4) (13: -1)*

Process 2

row 0: (2: -1)* (8: 4) (9: -1) (10: -1)
row 1: (3: -1)* (8: -1) (9: 4) (12: -1)* (11: -1)
row 2: (8: -1) (10: 4) (11: -1)
row 3: (9: -1) (10: -1) (11: 4) (14: -1)*

Process 3

row 0: (6: -1)* (9: -1)* (12: 4) (13: -1) (14: -1)
row 1: (7: -1)* (12: -1) (13: 4) (15: -1)
row 2: (12: -1) (11: -1)* (14: 4) (15: -1)
row 3: (13: -1) (14: -1) (15: 4)

And this is exactly how you would feed this matrix to amgcl. Stars here mean that a column belongs to a remote process, but the nonzero matrix value is still stored on the process that owns the row.

I hope this helps.

NOhs commented 6 years ago

Wow, thank you so much! I think everything is clear now. I think this answer could directly go into the documentation, it really explains all the aspects relevant to be able to use amgcl with mpi.

klausbu commented 6 years ago

Please include a best practice example about how to distribute a global NxN matrix available in CSR format to the individual MPI processes when you're updating the manual.

What is the related multi node backend concept? Hybrid MPI (nodes) + OMP(cores/GPUs of each node) or MPI managing all cores and/or GPUs across all nodes.

ddemidov commented 6 years ago

Thank you for reminding me I need to finish the new documentation :).

Re multinode configuration:

With the builtin (openmp) backend you can use either hybrid (MPI+OpenMP) or uniform (MPI-only, with export OMP_NUM_THREADS=1) setup. From experience, MPI-only setup scales slightly better. There is some data supporting this in the unfinished version of the benchmarks page here: https://amgcl.readthedocs.io/en/update-docs/benchmarks.html#id1

When using GPU backends, it is normal to use one MPI process per GPU. VexCL backend supports using multiple GPUs per MPI process, but again, I think the scalability will suffer in this case.

klausbu commented 5 years ago

What do you mean by "reordering"?

My starting point is the global matrix in (a multi section) COO format (nothing to do with decomposition, it's just not stored as a whole) and I am wondering how to pass it to multiple MPI processes?

ddemidov commented 5 years ago

Take a look at the https://github.com/ddemidov/amgcl/issues/71#issuecomment-386783451 above, specifically at the "4 MPI domains" example. Reordering or renumbering makes sure that each of your MPI domains own continuous set of unknown indices. There is a good chance your assembly process already satisfies this condition, so you don't have to do anything.

zhuminjie commented 4 years ago

what if the number of rows is not a multiply of number of processors, then what is the local size. Thank you!

zhuminjie commented 4 years ago

Say I have 15 rows and 4 processors, how many rows for each processor?

ddemidov commented 4 years ago

Usually, that decision is left to a decomposition library like Metis or Scotch. But in general, you try to split the problem as evenly as possible, so maybe 3 + 3 + 3 + 4?

Of course, when you have just 15 rows in the matrix, it does not make sense to use MPI (and when you have more, the differences between number of rows in each domain become relatively negligible).

zhuminjie commented 4 years ago

Thank you for the quick response! What I concern is that, for example, the problem is split into 3+3+3+4, so I have to fill the first 3 rows on processor 0, then 3 on processor 1, and so on. Does that mean I have to know the exact splitting, otherwise, I may misplace a row in a wrong processor?

This same question may be asked in another way, how amgcl assumes which rows belong to which processors? I hope this makes sense to you. Thank you so much!

ddemidov commented 4 years ago

You, as amgcl user, are the one responsible for the splitting (as I said, this task is usually delegated to a specialized library, but that happens outside of amgcl in any case). In each of the MPI processes, you only send to amgcl the rows that belong to that process. amgcl does not have access to the full matrix.

zhuminjie commented 4 years ago

Yes, thank you! I understand that amgcl is not responsible for splitting. My question is that whether amgcl assumes the rows in different processors goes continuously. With that I mean if processor 0 has 10 rows, process 1 has 8 rows, process 2 has 9 rows, amgcl assumes the full matrix has 27 rows, the first 10 is in P0, second 8 in P1, and third 9 in P2, regardless how many rows each processor has. Is that correct?

ddemidov commented 4 years ago

Yes, that is correct: amgcl assumes that the rows given to it are continuous and when stacked together they are forming the full matrix. In order to satisfy the requirement, you have to apply the renumbering procedure to the original matrix (described in the comment above).

zhuminjie commented 4 years ago

That's cool! I was wonder how amgcl knows the row ids. I got the part for the renumbering. Just make sure the numbering goes from P0 through Pn-1. It may require some calls to move rows around. There is no shared rows, right? All processors should have exclusive rows.

zhuminjie commented 4 years ago

Thank you for all your work! I have tested with my problems. amgcl is much faster than Petsc.

ddemidov commented 4 years ago

I was wonder how amgcl knows the row ids.

https://github.com/ddemidov/amgcl/blob/461a66ce6d197a3816218bf94ffd114a367c1ef1/amgcl/mpi/distributed_matrix.hpp#L334-L341

It may require some calls to move rows around

This function in examples/mpi/mpi_amg.cpp does the partitioning and renumbering, you can peek into it (or just use the code). It takes the rows from the original matrix (continuous chunks of rows), calls the partitioning library to obtain a better partitioning (parmetis or scotch, depending on what is available during compilation and the user choice), and returns the renumbered matrix (and replaces the rhs vector with the renumbered one):

https://github.com/ddemidov/amgcl/blob/461a66ce6d197a3816218bf94ffd114a367c1ef1/examples/mpi/mpi_amg.cpp#L188-L231

zhuminjie commented 4 years ago

Sound good! May I ask one more question. What's the difference between solve_block and solve_scalar in mpi_amg.cpp

Currently, I am using the examples from the benchmark repo.

ddemidov commented 4 years ago

solve_block solves systems where the matrix has a block structure (such as elasticity problems).

ddemidov commented 3 years ago

See https://amgcl.readthedocs.io/en/latest/tutorial.html

ddemidov / amgcl

Question: Using amgcl with MPI #71