Tpetra::CrsGraph<>::fillComplete runtime error when using periodic with overset

sayerhs commented 7 years ago

The TpetraLinearSystem::fillComplete throws an error when attempting to use periodic BC alongside overset BC. Steps to reproduce, replace inflow/outflow BC in reg_tests/test_files/oversetHybrid with periodic BC. The following error message is shown below:

Throw number = 1

Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)

Tpetra::CrsGraph<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::fillComplete: When converting column indices from global to local, we encountered 387 indices that do not live in the column Map on this process.  That's too many to print.

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:3725:

Throw number = 1

Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)

Tpetra::CrsGraph<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::fillComplete: When converting column indices from global to local, we encountered 72 indices that does not live in the column Map on this process.
Global column indices not in the column Map:
Local row -1: [1605787,1605788,1605789,1635190,1635191,1635192,1634893,1634894,1634895,1605490,1605491,1605492,1605787,1605788,1605789,1635190,1635191,1635192,1634893,1634894,1634895,1605490,1605491,1605492,1605787,1605788,1605789,1635190,1635191,1635192,1634893,1634894,1634895,1605490,1605491,1605492,1635190,1635191,1635192,1605787,1605788,1605789,1605490,1605491,1605492,1634893,1634894,1634895,1635190,1635191,1635192,1605787,1605788,1605789,1605490,1605491,1605492,1634893,1634894,1634895,1635190,1635191,1635192,1605787,1605788,1605789,1605490,1605491,1605492,1634893,1634894,1634895]

Cc: @spdomin, @alanw0

sayerhs commented 7 years ago

A brief update: I notice that if I disable buildOversetNodeGraph then throw doesn't occur. Looks like addConnections attempts to not just modify the rows for constraint equations, but also adds off-diagonal terms for the donor element to this fringe (orphan node in STK overset lingo). The only place where periodic is checked is in getDOFStatus, so I am unsure why it appears to behave properly for beginLinearSystemConstruction but not for compute_graph_row_lengths where it is used again.

spdomin commented 7 years ago

This code provides the stencil initialization that is needed to add the fringe point column entry from the owning element. I think that we need it for correctness. The current code also provides a symmetric contribution as do all matrix connectivity algorithms. In this case, the fringe point has a column added to all rows of the owning element. However, these are always zero.

sayerhs commented 7 years ago

Any pointers on why this segment of the code is mapping the wrong global to local ID when periodic is on?

mhoemmen commented 7 years ago

Tpetra is helpfully telling users that they must have put the wrong row or column indices into the graph before first fillComplete. I will look at the error processing more closely, but the issue almost certainly arises as a result of input error.

mhoemmen commented 7 years ago

As for "local row -1", I am working on fixing the exception message output. Regardless, this exception indicates user error, for the particular case of a DynamicProfile CrsGraph, most likely created with a column Map.

mhoemmen commented 7 years ago

I'll repeat here what I just wrote to @sthomas61 , in case it helps:

If rowMap != rangeMap or rowMap != domainMap, then you MUST give the domain and range Map to fillComplete.

The domain and range Maps MUST be nonoverlapping (one to one).

globallyOwnedMatrix_ almost certainly must have an overlapping row Map.

This implies that you MUST give the domain and range Maps (in that order) to globallyOwnedMatrix_->fillComplete.

alanw0 commented 7 years ago

It looks like we are not passing those maps. It is probably a simple matter to pass those maps, which might fix this issue. But perhaps this is really arguing that we should construct our own local graph structure...

mhoemmen commented 7 years ago

@alanw0 wrote:

But perhaps this is really arguing that we should construct our own local graph structure...

Not necessarily. Epetra's FillComplete had and has the same requirement as Tpetra's fillComplete, namely that it needs explicit domain and range Maps if the row, domain, and range Maps are not all the same. This is always true for the overlapping ("shared") graph.

sayerhs commented 7 years ago

@mhoemmen This issue issn't implying that the problem was on the Tpetra side, it is a problem with Nalu's TpetraLinearSystem class that is creating the Graphs and calling fillComplete on it, and manifests in a very specific use case, namely combining overset and periodic BCs. I am just trying to understand what is different about this specific use case in Nalu that messes up the graph. Note that this is with Nalu master branch with no modifications to the source code.

It would be useful if we could somehow figure out the offending row's Global ID so that we can trace it through in a debugger.

mhoemmen commented 7 years ago

Hi @sayerhs ! @sthomas61 shared the code with me. It looks like the code needs to give the domain and range Map to the overlapping graph's constructor. I'm surprised that it wasn't; it must be that overset + periodic BCs had a separate code path.

sayerhs commented 7 years ago

Thanks @mhoemmen. I'll check with @sthomas61 what code he shared with you. I've been on a flight today and am just catching up with correspondence. I don't immediately see that the CrsGraph constructors are called differently in TpetraLinearSystem for this particular use case compared to all other boundary conditions. But it is likely that I have overlooked something that people more experienced with this class can comment on.

Looking at your comments above, it appears that we should always pass domain and range Maps which we haven't been doing in Nalu code. Did I understand you correctly?

mhoemmen commented 7 years ago

@sayerhs wrote:

Looking at your comments above, it appears that we should always pass domain and range Maps which we haven't been doing in Nalu code. Did I understand you correctly?

Yes. You need to use the correct domain and range Maps too; don't just pass in the row Map if it is not one to one. https://github.com/trilinos/Trilinos/issues/1925 proposes to check the input Maps in a debug build, but that doesn't change the semantics of fillComplete.

alanw0 commented 7 years ago

@sayerhs @mhoemmen In nalu I don't think we currently have a 1-to-1 map for our range maps. our row map is 1-to-1 for the owned graph/matrix, but not for the globallyOwned. We can use the Tpetra function for creating a 1-to-1 map, but that probably involves another communication step right?

alanw0 commented 7 years ago

And as I recall, the domainMap corresponds to the row distribution, while the rangeMap corresponds to the column distribution.

sayerhs commented 7 years ago

@alanw0 Except for the case of periodic + overset, all other Matrix setups pass this assertion

  TEUCHOS_TEST_FOR_EXCEPTION(!A_->getDomainMap()->isSameAs(*A_->getRangeMap()), std::runtime_error,
      Teuchos::typeName (*this) << "::apply(): A's domain and range map must be the same");

This would indicate that Domain and Range Maps are the same?

alanw0 commented 7 years ago

@sayerhs That's probably true. It may simply mean that our matrices are globally square, which is true. But it may also mean that we can pass our owned-rows row-map as the domain and range maps for all cases. That would be great, if it saves us the expense of creating a new 1-to-1 map.

sayerhs commented 7 years ago

@alanw0 If it is not too much trouble, can you explain the distinction between ownedRow vs. a globallyOwnedRow? I realize now that I am not sure of what this really means from a physical sense with regard to a STK node on a computational mesh.

alanw0 commented 7 years ago

@sayerhs ownedRow corresponds to a node that is owned by the local processor. globallyOwnedRow (not the best name) corresponds to a node that is on a processor boundary and may be owned by the other processor that shares it. In stk terminology those are just called shared nodes. A shared node exists on multiple processors (mpi ranks) and is owned by one of them.

mhoemmen commented 7 years ago

@alanw0 Would it make sense to call these matrices "owned" and "shared," so that they correspond exactly to STK terms?

alanw0 commented 7 years ago

@mhoemmen Perhaps "owned" and "ownedPlusShared" would be good names. I need to get deeper into the class to make sure my understanding of nalu's maps and graphs is completely correct.

sayerhs commented 7 years ago

@alanw0 Does that mean that globallyOwned indicates that it is a superset containing both owned (also in ownedRows) as well as the shared nodes?

alanw0 commented 7 years ago

@sayerhs I was just looking in the code and I'm having trouble figuring that out. It seems to be determined by the getDofStatus function, which unfortunately appears to have gotten out of control...

sayerhs commented 7 years ago

@alanw0 Yes, that is a big part of why I had to create this issue since I can't seem to fully grok the behavior of getDOFStatus especially when it is called from compute_graph_row_lengths.

alanw0 commented 7 years ago

@sayerhs I think it would be worthwhile to break getDofStatus out into a function that can be unit-tested. I see a potential bug in there, where it defines ghosted to be !owned && !shared. Technically a node could be both shared and ghosted, though that may be a rare scenario... But getDofStatus is complicated enough that nailing down its behavior in some unit-tests would probably be valuable.

mhoemmen commented 7 years ago

I have a patch that checks input Maps in fillComplete: https://github.com/trilinos/Trilinos/pull/1930

The patch causes some downstream tests to fail in Trilinos, so I won't apply it quite yet, but you're welcome to try it out.

spdomin commented 7 years ago

I agree that a name change would be useful as these questions keep re-surfacing.

dofStatus was not really designed for aura. As such, I feel that !shared and !owned is, in fact, ghosted:)

mhoemmen commented 7 years ago

I prefer the following terms for three sets that are disjoint on each MPI process:

Owned: DOF which my process computes (e.g., in linear solves)
Shared: DOF to which my process contributes, but which my process does not compute
Reachable: DOF on a neighboring process of mine, neither owned nor shared, but connected to an owned or shared DOF on my process

Here is how that maps to Tpetra data structures:

The nonoverlapping graph's / matrix's row Map has the owned DOFs.
The overlapping graph's / matrix's row Map has the shared DOFs.
The overlapping graph's / matrix's column Map includes owned DOFs, shared DOFs, and reachable DOFs.
The nonoverlapping graph's / matrix's column Map is a subset of the above, that may (generally does) include reachable DOFs.

(This reminds me of a conversation I had with Matt Bettencourt back in 2011 or so.)

Note that if the decomposition of elements over processes (for a node-based discretization) is unique -- that is, with no overlap -- then a process does not know the reachable DOFs a priori. That's what makes it hard to construct the graph with local indices. (This is equivalent to constructing the overlapping graph's column Map before actually having the overlapping graph.) With a sufficiently overlapping element decomposition -- something like Aura -- each process can know its reachable DOFs a priori.

There are other ways to discover reachable DOFs besides Aura. For example, you could use global indices to construct the overlapping and nonoverlapping graphs, then Export from the overlapping graph to the nonoverlapping graph, then let fillComplete on the latter create the column Map. However, that might not give you the column Map ordering that you want. Ideally, you would want owned first, then shared, then reachable. It is possible to reindex (reorder) an existing graph -- that is, to use the column Map you have to create the column Map you want, then reassign all the indices. Tpetra has a method for that.

Other approach is to use knowledge of neighboring processes (that typically comes from the mesh). You can figure out from your part of the mesh, what DOFs you have that might be reachable for your neighbors. You may then use boundary exchange (either STK's or four-argument Tpetra::Distributor::doPosts) to tell your neighbors their reachable DOFs. If you're clever, you can use this approach to bypass column Map and Tpetra::Import construction in Tpetra::CrsGraph::fillComplete. This will almost certainly avoid the issues Nalu noticed at extreme MPI process counts with arbitrary DOF ownership discovery (Tpetra::Map::getRemoteIndexList).

@tjfulle and @william76 might find this interesting.

sayerhs commented 7 years ago

@mhoemmen I like the proposed nomenclature! It is intuitive and the intent is clear.

alanw0 commented 7 years ago

@mhoemmen I agree with your description of owned or shared. I'm not as sure about reachable. I think that gets a little complicated with periodic and/or contact search ghosting. For a mesh with no ghosting but "regular" shared nodes on mpi-proc boundaries, reachable only includes "nodes on other procs that are connected to shared nodes that I own". But in the presence of ghosting, I think reachable also includes the ghosts. Stk-mesh allows you to identify "send ghosts" and "recv ghosts". A send ghost is a node that I own, which has been ghosted to another proc. A recv ghost is a remotely-owned node that has been ghosted to me. For periodic I'm even less sure... would reachable include remotely-owned nodes that are connected to periodic nodes that I own?

alanw0 commented 7 years ago

@spdomin In fact stk-mesh defines the 'aura' ghosting to be a 1-element-thick layer around shared, so by definition no node that is ghosted by the aura can also be shared (i.e., your definition is perfectly correct for aura). But for "custom" ghostings, nodes to ghost are purely chosen by the caller (i.e. nalu), and so a shared node could also be a ghosted node (to the same proc that shares the node, or to a different proc entirely). Perhaps that is a very rare scenario...

sthomas61 commented 7 years ago

In the case of our V27 mesh, with possibly hanging nodes there is a problem after assembly. When looping through the local rows on a rank, and pulling out the global column indices - and use these to look up a "node" in the STK bulk data - but somehow the aura/ghosting or hanging nodes leads to is_valid(node) = INVALID for some of these nodes.

mhoemmen commented 7 years ago

@alanw0 Do ghost DOFs have the same global index on both processes (owner and "ghoster"), or different global indices? For example, if my process ("owner") owns a DOF and it has been ghosted to a different process ("ghoster"), then does the "ghoster" see the same global index for that ghosted DOF that my "owner" process sees?

alanw0 commented 7 years ago

@mhoemmen for stk-mesh nodes, the global id is the same for the scenario you describe. However, nalu has a field 'naluGlobalID_' which allows for giving Tpetra global-ids that are different than the stk-mesh global-id. This happens in the case of periodic, I don't know of any other scenario that should cause that to happen.

alanw0 commented 7 years ago

@sthomas61 Does your case have periodic boundary conditions? If so, that might explain why there is a discrepancy like you describe. If not, then I'm not sure. Sounds like a bug somewhere....

sayerhs commented 7 years ago

@alanw0 You are correct, the only situation where stkID != naluGlobalID is for periodic nodes (and that too only on the slave part, the master part will have stkID == naluGlobalID.

Currently we don't have a situation where any custom ghosting ghosts nodes/elements if they are already shared on that particular rank.

mhoemmen commented 7 years ago

@alanw0 wrote:

This happens in the case of periodic, I don't know of any other scenario that should cause that to happen.

If STK already gives the owned and ghosted versions of a periodic BC the same GID, then why would the solver want them to have different GIDs?

I guess you could handle periodic BCs like Dirichlet BCs, by treating owned and ghosted versions as separate DOFs and using the matrix to constrain them to be equal. However, that could lead to trouble, because linear solvers only promise accuracy in the global L2 norm. You really need a stronger norm guarantee to get boundary conditions right that way.

I would prefer to see the ghosted DOFs excluded, then use an Export with REPLACE CombineMode to copy from owned to ghosted. (The Export's source Map would be the owned DOFs that are ghosted, and the target Map would be the corresponding ghosted DOFs.)

sayerhs commented 7 years ago

@mhoemmen The ghosted nodes are never part of the Tpetra Matrix, they are always excluded in the MPI rank they are ghosted onto. So are the slave nodes of a periodic boundary pair.

mhoemmen commented 7 years ago

@sayerhs Are "ghosted" and "slave" just different words / PDE concepts that describe the same mechanism?

alanw0 commented 7 years ago

@mhoemmen I think periodic involves creating pairs of nodes that are to be considered the same. If each pair contains a 'master' and a 'slave', then nalu associates the global-id of the master with the slave. Thus stkId != naluGlobalID_ for the slave. So it's not that nalu is giving a different id to the ghosted vs the owned node, but rather nalu is giving a different id to the slave node. I don't know if I made that more clear or less clear...

alanw0 commented 7 years ago

@mhoemmen "ghosted" is a stk concept which means take a node (or other mesh entity) owned on 1 mpi-proc, and make a copy of it on an arbitrary other mpi-proc. "slave" is a nalu concept which says "consider 1 node to be the same as another (master) node, regardless of whether both nodes are on the same mpi-proc or not". stk-mesh knows nothing about the 'slaving', and considers the nodes to still be distinct and different. nalu associates the global-id of the master node with the slave node for the purposes of Tpetra objects.

sayerhs commented 7 years ago

The notion of a slave node only occurs in periodic BC. The two BC planes considered periodic have coincident nodes (if one of the planes in the pair was translated to match the other plane). As we try to sumInto matrix, Nalu intercepts every access of slave node's row ID and instead passes the master node's row... so all elements connected to the slave node populate the row for the master row. And the elements connected to the master node access the master node's row ID as usual.

Ghosted elements occur when dealing with non-conformal or overset BC where a particular row (for a node) actually has entries from an element belonging to a different mesh block. Ghosting is necessary here to determine contributions to a particular node on MPI rank X when the element it needs contribution from exists on MPI rank Y. The non-conformal BC description in the theory manual might be useful in visualizing this.

spdomin commented 7 years ago

late to this...

I guess you could handle periodic BCs like Dirichlet BCs, by treating owned and ghosted versions as separate DOFs and using the matrix to constrain them to be equal. However, that could lead to trouble, because linear solvers only promise accuracy in the global L2 norm. You really need a stronger norm guarantee to get boundary conditions right that way.

Past codes at SNL managed periodic this way. However, the easiest manner to manage periodic is to manage ids such that they periodic pairs/triplets are the same.

This thread is getting rather long. It may be better to attack the problem with a short meeting and design documents.

For the record, we never wanted dofStatus. We always wanted a design that would remove load_complete() at the price of serving up multiple elements to a locally owned node. This is the Aura-based approach in STK that would ghost all elemental contributions to a locally owned row. This removes dofStatus and pushes a design to manage ghosted entities. In fact, we implemented this design a year+ ago. It was simple and well performing in absence of mesh modification.

The integrated solution will likely be to push towards an aura-based approach with ghosting management once aura is not expensive in STK. Alternatively, we can seek an approach where aura is off while an application custom-ghosting is used for defining all elements connected to a node.

spdomin commented 7 years ago

The more I review this thread, the more I feel that there is a need to meet and talk. There are too many concepts floating around. In some cases, I question some of the statements made. This may simply mean my understanding is lacking, however, I would like to close the gap.

@sthomas61, do you feel that this comment is related to the Tpetra overset/periodic use case as sliding mesh currently works fine. The comment below is related to a Hypre interface - at least that was my understanding.

In the case of our V27 mesh, with possibly hanging nodes there is a problem after assembly. When looping through the local rows on a rank, and pulling out the global column indices - and use these to look up a "node" in the STK bulk data - but somehow the aura/ghosting or hanging nodes leads to is_valid(node) = INVALID for some of these nodes.

mhoemmen commented 7 years ago

@spdomin It would make sense for us to write up a document. That would go along with efforts by @tjfulle and @william76 to codify best practices for using Tpetra. We haven't yet codified best practices for things like periodic BCs, oversets, and moving meshes, so now would be a good time :-) .

spdomin commented 7 years ago

Okay, later next week, I will add a story that speaks to the set of matrix contribution use cases. The theory document shows some good images for overset and I have a paper that speaks to sliding mesh.

When we first implemented the dofStatus, we excluded periodic being active with sliding-mesh/overset. Very recently, we extended support to include sliding mesh and periodic without too much effort in dofStatus. I thought looking back at that time the modifications required to support periodic/overset were straight forward. The full combination of all three active was the line I was not willing to cross with dofStatus and worked through a high level ghosting manager to get the job done. The issue with the ghosting management is the new ghosting connectivities that need to be made to ensure that one locally owned row in fact has all of the connectivity it needs. For periodic and sliding mesh, this becomes more complex - nothing is for free in this business.

Finally, we know that the ghosting design is what we need for high performance. Other PSAAP-II implicit projects have demonstrated this already. As such, we may want to make that turn sooner than later.

NaluCFD / Nalu

Tpetra::CrsGraph<>::fillComplete runtime error when using periodic with overset #211