adamantine-sim / adamantine

Software to simulate heat transfer for additive manufacturing
https://adamantine-sim.github.io/adamantine/
Other
31 stars 9 forks source link

Hourglass demonstration case fails on multiple MPI processes #277

Open stvdwtt opened 2 months ago

stvdwtt commented 2 months ago

The hourglass demonstration case (link) hits a divide-by-zero error on multiple MPI processes. Evidently this has been a problem for a while and is independent of the problem in #278.

Screenshot of the error from @AshGannon: image

masterleinad commented 2 months ago

Does it also fail when running with only one MPI process?

Rombur commented 2 months ago

No it doesn't

masterleinad commented 2 months ago

The last deal.II function is https://github.com/dealii/dealii/blob/e9eb5ab491aab6b0e57e9b552a4e5d64e20077a6/source/base/mpi_compute_index_owner_internal.cc#L432-L454.

So owned_indices.size() is likely zero which is checked in Debug mode.

Rombur commented 2 months ago

Yes, the error message is misleading but Ashley doesn't have a debug version of the code.

stvdwtt commented 2 months ago

Ashley and I sorted out the source of this problem this morning (and I helped Ashley build a debug version). The problem is that there is no substrate for the hourglass print and so at the start of the simulation there are no activated cells. This turns out to be fine in serial, but at some point for multiple MPI processes there's a division by the number of DOFs.

There is nothing wrong with the code, this is just an odd use case. My plan is to add a check so that adamantine will fail gracefully if this happens. I don't expect users to purposefully do simulations with no active elements initially, but I can see this happening accidentally (e.g. a user sets the material_height parameter incorrectly).

Rombur commented 2 months ago

but at some point for multiple MPI processes there's a division by the number of DOFs.

It's probably because we partition the mesh in such a way that each processors get the same number of DOFs. Can you try to remove these lines https://github.com/adamantine-sim/adamantine/blob/8c28e5937fcd727bbb358580ed9b88e8fa5da3d4/source/ThermalPhysics.templates.hh#L276-L278 and tell me if that fixes the issue. Without this function, the partitioning will ignore the number of DOFs. If that fixes the issue, we could check if the number of DOFs is greater than zero to decide the type of load balancing we want to do.

AshGannon commented 2 months ago

I will look into this more when I finish my SLUG talk - this is the output after commenting lines 276-278 @Rombur

image

Rombur commented 2 months ago

You probably have the same issue in serial but because the checks are disabled in release mode, the code kept running. We should probably just skip the initialization if no cell is activated.