erdc / proteus

A computational methods and simulation toolkit
http://proteustoolkit.org
MIT License
88 stars 56 forks source link

new parallel mesh partitioning is not scaling #122

Closed cekees closed 10 years ago

cekees commented 10 years ago

this sucks, but it's true.

ahmadia commented 10 years ago

At what number of processes is this showing up? Or is it not scaling at a given mesh size?

cekees commented 10 years ago

I saw it at 2048 processes. Since garnet has < 2GB per core when running with the max 32 cores per node I've been planning runs based on 1000-2000 mesh vertices per node. Since I was trying to get results I hadn't submitted any jobs where I had reduced the core counts for the same mesh and compute node count.

On Mon, Sep 22, 2014 at 2:32 PM, Aron Ahmadia notifications@github.com wrote:

At what number of processes is this showing up? Or is it not scaling at a given mesh size?

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56427667.

ahmadia commented 10 years ago

And the symptom is "did not complete", not "took a really long time", right?

On Mon, Sep 22, 2014 at 11:22 PM, Chris Kees notifications@github.com wrote:

I saw it at 2048 processes. Since garnet has < 2GB per core when running with the max 32 cores per node I've been planning runs based on 1000-2000 mesh vertices per node. Since I was trying to get results I hadn't submitted any jobs where I had reduced the core counts for the same mesh and compute node count.

On Mon, Sep 22, 2014 at 2:32 PM, Aron Ahmadia notifications@github.com wrote:

At what number of processes is this showing up? Or is it not scaling at a given mesh size?

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56427667.

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56472484.

cekees commented 10 years ago

I'm checking that as well. I have a 2048 core job running right now. It had been "partitioning" since 6AM. I've only got 8 hours reserved because it's so hard to find any time on the machine right now. The previous level of mesh refinement completed on a smaller core count so we should have two good data points by the end of today. I can try the lower level of mesh refinement on a higher core count and the high level of mesh refinement on a lower core count with the same memory resources.

On Mon, Sep 22, 2014 at 10:55 PM, Aron Ahmadia notifications@github.com wrote:

And the symptom is "did not complete", not "took a really long time", right?

On Mon, Sep 22, 2014 at 11:22 PM, Chris Kees notifications@github.com wrote:

I saw it at 2048 processes. Since garnet has < 2GB per core when running with the max 32 cores per node I've been planning runs based on 1000-2000 mesh vertices per node. Since I was trying to get results I hadn't submitted any jobs where I had reduced the core counts for the same mesh and compute node count.

On Mon, Sep 22, 2014 at 2:32 PM, Aron Ahmadia notifications@github.com

wrote:

At what number of processes is this showing up? Or is it not scaling at a given mesh size?

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56427667.

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56472484.

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56473913.

cekees commented 10 years ago

Specifically, the job running now has been going for 3:45 on 2048 cores and the mesh statistics are

Mesh points: 4069257 Mesh tetrahedra: 25233022 Mesh faces: 50653507 Mesh edges: 29489740 Mesh boundary faces: 374926 Mesh boundary edges: 2916

So that's a little over the roughly 20M element ceiling that Matthew and I were hitting with memory usage on the old partitioning scheme. I'm guessing we're going to need to switch to hdf5 for the mesh so each proc can just directly pull its chunk.

On Tue, Sep 23, 2014 at 9:44 AM, Chris Kees cekees@gmail.com wrote:

I'm checking that as well. I have a 2048 core job running right now. It had been "partitioning" since 6AM. I've only got 8 hours reserved because it's so hard to find any time on the machine right now. The previous level of mesh refinement completed on a smaller core count so we should have two good data points by the end of today. I can try the lower level of mesh refinement on a higher core count and the high level of mesh refinement on a lower core count with the same memory resources.

On Mon, Sep 22, 2014 at 10:55 PM, Aron Ahmadia notifications@github.com wrote:

And the symptom is "did not complete", not "took a really long time", right?

On Mon, Sep 22, 2014 at 11:22 PM, Chris Kees notifications@github.com wrote:

I saw it at 2048 processes. Since garnet has < 2GB per core when running with the max 32 cores per node I've been planning runs based on 1000-2000 mesh vertices per node. Since I was trying to get results I hadn't submitted any jobs where I had reduced the core counts for the same mesh and compute node count.

On Mon, Sep 22, 2014 at 2:32 PM, Aron Ahmadia notifications@github.com

wrote:

At what number of processes is this showing up? Or is it not scaling at a given mesh size?

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56427667.

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56472484.

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56473913.

ahmadia commented 10 years ago

I'm guessing we're going to need to switch to hdf5 for the mesh so each proc can just directly pull its chunk.

How does it work now?

cekees commented 10 years ago

It reads the tetgen ascii output files directly. Each core opens the files and reads them line by line, storing information in arrays in memory only if it needs it for the default (dumb) partitioning.

https://github.com/erdc-cm/proteus/blob/master/src/flcbdfWrappersModule.cpp#L2773

On Tue, Sep 23, 2014 at 9:57 AM, Aron Ahmadia notifications@github.com wrote:

I'm guessing we're going to need to switch to hdf5 for the mesh so each proc can just directly pull its chunk.

How does it work now?

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56532895.

cekees commented 10 years ago

I appended some petsc profling information in commit aa1f13b3ca12. I'm doing runs with the latest commit on that branch, which has finer grained profling in the slow section.

cekees commented 10 years ago

Here's a more detailed profile for a smaller run in 8 cores

https://gist.github.com/cekees/a7007a4b37189526382c

jedbrown commented 10 years ago

I haven't been following this thread, but are you using @knepley's partitioning or something custom? Are you testing with a real parallel file system and using MPI-IO collective reads?

knepley commented 10 years ago

Nope that is something different. I think you can get to 1000s of procs using the strategy we use in PyLith: distribute a coarse mesh and regularly refine in parallel. For anything above that, you need to naively partition, distribute, and repartition in parallel.

jedbrown commented 10 years ago

Few engineering meshes are a refinement of a coarser mesh.

knepley commented 10 years ago

I am not sure I agree with that sweeping statement, however in the case that you cannot use this, you are stuck with the second strategy.

On Wed, Sep 24, 2014 at 9:42 PM, Jed Brown notifications@github.com wrote:

Few engineering meshes are a refinement of a coarser mesh.

— Reply to this email directly or view it on GitHub https://github.com/erdc-cm/proteus/issues/122#issuecomment-56767158.

What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener

cekees commented 10 years ago

Hey, thanks for the help. I'm using the naive partition then repartition in parallel approach already, so its' no big deal. It's from one of the PETSc examples that bsmith wrote, but I don't think it's included as an example in recent versions of PETSc, or I'd give you the number. That example uses a global bitmap to assign ownership to each mesh entity and then it sends the updated bitmap to the next rank so it's inherently unscalable, but I thought it would be OK for the size domains we're looking at.

Because that part of the algorithm just builds ownership sets, I'm trying something fairly simple to compute the ownership sets instead of that mark-and-pass approach: taking the average of the new node numbers of the entity and then taking ownership of the entity if the node number average is in the range of nodes owned by that processor.

cekees commented 10 years ago

Also, I'm not using MPI-I/O collective reads. The file system is lustre.

cekees commented 10 years ago

OK, replacing the communication of mesh entity ownership with a locally ownership calculation seems to have resolved this issue. The stages that populated and sent/received the global bit array were not scaling. The only thing that has changed is that now ownership of elements, faces, and edges is set based on the (already computed) ownership of the "median" node in the entity, where median for even length entities is just the lower of the two middle entries (i.e. 1 is the median in all of [1,2],[0,1,3],[0,1,3,4]). Here's a profile for the 25M tet mesh above running on 512 cores.

https://gist.github.com/cekees/b70f826ae2bf32769c93