Closed sayerhs closed 6 years ago
A brief update: I notice that if I disable buildOversetNodeGraph then throw doesn't occur. Looks like addConnections
attempts to not just modify the rows for constraint equations, but also adds off-diagonal terms for the donor element to this fringe (orphan node in STK overset lingo). The only place where periodic
is checked is in getDOFStatus
, so I am unsure why it appears to behave properly for beginLinearSystemConstruction
but not for compute_graph_row_lengths
where it is used again.
This code provides the stencil initialization that is needed to add the fringe point column entry from the owning element. I think that we need it for correctness. The current code also provides a symmetric contribution as do all matrix connectivity algorithms. In this case, the fringe point has a column added to all rows of the owning element. However, these are always zero.
Any pointers on why this segment of the code is mapping the wrong global to local ID when periodic is on?
Tpetra is helpfully telling users that they must have put the wrong row or column indices into the graph before first fillComplete. I will look at the error processing more closely, but the issue almost certainly arises as a result of input error.
As for "local row -1", I am working on fixing the exception message output. Regardless, this exception indicates user error, for the particular case of a DynamicProfile CrsGraph, most likely created with a column Map.
I'll repeat here what I just wrote to @sthomas61 , in case it helps:
If rowMap != rangeMap or rowMap != domainMap, then you MUST give the domain and range Map to fillComplete.
The domain and range Maps MUST be nonoverlapping (one to one).
globallyOwnedMatrix_ almost certainly must have an overlapping row Map.
This implies that you MUST give the domain and range Maps (in that order) to globallyOwnedMatrix_->fillComplete
.
It looks like we are not passing those maps. It is probably a simple matter to pass those maps, which might fix this issue. But perhaps this is really arguing that we should construct our own local graph structure...
@alanw0 wrote:
But perhaps this is really arguing that we should construct our own local graph structure...
Not necessarily. Epetra's FillComplete had and has the same requirement as Tpetra's fillComplete, namely that it needs explicit domain and range Maps if the row, domain, and range Maps are not all the same. This is always true for the overlapping ("shared") graph.
@mhoemmen This issue issn't implying that the problem was on the Tpetra side, it is a problem with Nalu's TpetraLinearSystem
class that is creating the Graphs and calling fillComplete on it, and manifests in a very specific use case, namely combining overset and periodic BCs. I am just trying to understand what is different about this specific use case in Nalu that messes up the graph. Note that this is with Nalu master
branch with no modifications to the source code.
It would be useful if we could somehow figure out the offending row's Global ID so that we can trace it through in a debugger.
Hi @sayerhs ! @sthomas61 shared the code with me. It looks like the code needs to give the domain and range Map to the overlapping graph's constructor. I'm surprised that it wasn't; it must be that overset + periodic BCs had a separate code path.
Thanks @mhoemmen. I'll check with @sthomas61 what code he shared with you. I've been on a flight today and am just catching up with correspondence. I don't immediately see that the CrsGraph constructors are called differently in TpetraLinearSystem for this particular use case compared to all other boundary conditions. But it is likely that I have overlooked something that people more experienced with this class can comment on.
Looking at your comments above, it appears that we should always pass domain and range Maps which we haven't been doing in Nalu code. Did I understand you correctly?
@sayerhs wrote:
Looking at your comments above, it appears that we should always pass domain and range Maps which we haven't been doing in Nalu code. Did I understand you correctly?
Yes. You need to use the correct domain and range Maps too; don't just pass in the row Map if it is not one to one. https://github.com/trilinos/Trilinos/issues/1925 proposes to check the input Maps in a debug build, but that doesn't change the semantics of fillComplete.
@sayerhs @mhoemmen In nalu I don't think we currently have a 1-to-1 map for our range maps. our row map is 1-to-1 for the owned graph/matrix, but not for the globallyOwned. We can use the Tpetra function for creating a 1-to-1 map, but that probably involves another communication step right?
And as I recall, the domainMap corresponds to the row distribution, while the rangeMap corresponds to the column distribution.
@alanw0 Except for the case of periodic + overset
, all other Matrix setups pass this assertion
TEUCHOS_TEST_FOR_EXCEPTION(!A_->getDomainMap()->isSameAs(*A_->getRangeMap()), std::runtime_error,
Teuchos::typeName (*this) << "::apply(): A's domain and range map must be the same");
This would indicate that Domain and Range Maps are the same?
@sayerhs That's probably true. It may simply mean that our matrices are globally square, which is true. But it may also mean that we can pass our owned-rows row-map as the domain and range maps for all cases. That would be great, if it saves us the expense of creating a new 1-to-1 map.
@alanw0 If it is not too much trouble, can you explain the distinction between ownedRow vs. a globallyOwnedRow? I realize now that I am not sure of what this really means from a physical sense with regard to a STK node on a computational mesh.
@sayerhs ownedRow corresponds to a node that is owned by the local processor. globallyOwnedRow (not the best name) corresponds to a node that is on a processor boundary and may be owned by the other processor that shares it. In stk terminology those are just called shared nodes. A shared node exists on multiple processors (mpi ranks) and is owned by one of them.
@alanw0 Would it make sense to call these matrices "owned" and "shared," so that they correspond exactly to STK terms?
@mhoemmen Perhaps "owned" and "ownedPlusShared" would be good names. I need to get deeper into the class to make sure my understanding of nalu's maps and graphs is completely correct.
@alanw0 Does that mean that globallyOwned
indicates that it is a superset containing both owned (also in ownedRows) as well as the shared nodes?
@sayerhs I was just looking in the code and I'm having trouble figuring that out. It seems to be determined by the getDofStatus function, which unfortunately appears to have gotten out of control...
@alanw0 Yes, that is a big part of why I had to create this issue since I can't seem to fully grok the behavior of getDOFStatus
especially when it is called from compute_graph_row_lengths
.
@sayerhs I think it would be worthwhile to break getDofStatus out into a function that can be unit-tested. I see a potential bug in there, where it defines ghosted to be !owned && !shared. Technically a node could be both shared and ghosted, though that may be a rare scenario... But getDofStatus is complicated enough that nailing down its behavior in some unit-tests would probably be valuable.
I have a patch that checks input Maps in fillComplete: https://github.com/trilinos/Trilinos/pull/1930
The patch causes some downstream tests to fail in Trilinos, so I won't apply it quite yet, but you're welcome to try it out.
I agree that a name change would be useful as these questions keep re-surfacing.
dofStatus was not really designed for aura. As such, I feel that !shared and !owned is, in fact, ghosted:)
I prefer the following terms for three sets that are disjoint on each MPI process:
Here is how that maps to Tpetra data structures:
(This reminds me of a conversation I had with Matt Bettencourt back in 2011 or so.)
Note that if the decomposition of elements over processes (for a node-based discretization) is unique -- that is, with no overlap -- then a process does not know the reachable DOFs a priori. That's what makes it hard to construct the graph with local indices. (This is equivalent to constructing the overlapping graph's column Map before actually having the overlapping graph.) With a sufficiently overlapping element decomposition -- something like Aura -- each process can know its reachable DOFs a priori.
There are other ways to discover reachable DOFs besides Aura. For example, you could use global indices to construct the overlapping and nonoverlapping graphs, then Export from the overlapping graph to the nonoverlapping graph, then let fillComplete on the latter create the column Map. However, that might not give you the column Map ordering that you want. Ideally, you would want owned first, then shared, then reachable. It is possible to reindex (reorder) an existing graph -- that is, to use the column Map you have to create the column Map you want, then reassign all the indices. Tpetra has a method for that.
Other approach is to use knowledge of neighboring processes (that typically comes from the mesh). You can figure out from your part of the mesh, what DOFs you have that might be reachable for your neighbors. You may then use boundary exchange (either STK's or four-argument Tpetra::Distributor::doPosts
) to tell your neighbors their reachable DOFs. If you're clever, you can use this approach to bypass column Map and Tpetra::Import
construction in Tpetra::CrsGraph::fillComplete
. This will almost certainly avoid the issues Nalu noticed at extreme MPI process counts with arbitrary DOF ownership discovery (Tpetra::Map::getRemoteIndexList
).
@tjfulle and @william76 might find this interesting.
@mhoemmen I like the proposed nomenclature! It is intuitive and the intent is clear.
@mhoemmen I agree with your description of owned or shared. I'm not as sure about reachable. I think that gets a little complicated with periodic and/or contact search ghosting. For a mesh with no ghosting but "regular" shared nodes on mpi-proc boundaries, reachable only includes "nodes on other procs that are connected to shared nodes that I own". But in the presence of ghosting, I think reachable also includes the ghosts. Stk-mesh allows you to identify "send ghosts" and "recv ghosts". A send ghost is a node that I own, which has been ghosted to another proc. A recv ghost is a remotely-owned node that has been ghosted to me. For periodic I'm even less sure... would reachable include remotely-owned nodes that are connected to periodic nodes that I own?
@spdomin In fact stk-mesh defines the 'aura' ghosting to be a 1-element-thick layer around shared, so by definition no node that is ghosted by the aura can also be shared (i.e., your definition is perfectly correct for aura). But for "custom" ghostings, nodes to ghost are purely chosen by the caller (i.e. nalu), and so a shared node could also be a ghosted node (to the same proc that shares the node, or to a different proc entirely). Perhaps that is a very rare scenario...
In the case of our V27 mesh, with possibly hanging nodes there is a problem after assembly. When looping through the local rows on a rank, and pulling out the global column indices - and use these to look up a "node" in the STK bulk data - but somehow the aura/ghosting or hanging nodes leads to is_valid(node) = INVALID for some of these nodes.
@alanw0 Do ghost DOFs have the same global index on both processes (owner and "ghoster"), or different global indices? For example, if my process ("owner") owns a DOF and it has been ghosted to a different process ("ghoster"), then does the "ghoster" see the same global index for that ghosted DOF that my "owner" process sees?
@mhoemmen for stk-mesh nodes, the global id is the same for the scenario you describe. However, nalu has a field 'naluGlobalID_' which allows for giving Tpetra global-ids that are different than the stk-mesh global-id. This happens in the case of periodic, I don't know of any other scenario that should cause that to happen.
@sthomas61 Does your case have periodic boundary conditions? If so, that might explain why there is a discrepancy like you describe. If not, then I'm not sure. Sounds like a bug somewhere....
@alanw0 You are correct, the only situation where stkID != naluGlobalID
is for periodic nodes (and that too only on the slave part, the master part will have stkID == naluGlobalID
.
Currently we don't have a situation where any custom ghosting ghosts nodes/elements if they are already shared on that particular rank.
@alanw0 wrote:
This happens in the case of periodic, I don't know of any other scenario that should cause that to happen.
If STK already gives the owned and ghosted versions of a periodic BC the same GID, then why would the solver want them to have different GIDs?
I guess you could handle periodic BCs like Dirichlet BCs, by treating owned and ghosted versions as separate DOFs and using the matrix to constrain them to be equal. However, that could lead to trouble, because linear solvers only promise accuracy in the global L2 norm. You really need a stronger norm guarantee to get boundary conditions right that way.
I would prefer to see the ghosted DOFs excluded, then use an Export with REPLACE CombineMode to copy from owned to ghosted. (The Export's source Map would be the owned DOFs that are ghosted, and the target Map would be the corresponding ghosted DOFs.)
@mhoemmen The ghosted nodes are never part of the Tpetra Matrix, they are always excluded in the MPI rank they are ghosted onto. So are the slave nodes of a periodic boundary pair.
@sayerhs Are "ghosted" and "slave" just different words / PDE concepts that describe the same mechanism?
@mhoemmen I think periodic involves creating pairs of nodes that are to be considered the same. If each pair contains a 'master' and a 'slave', then nalu associates the global-id of the master with the slave. Thus stkId != naluGlobalID_ for the slave. So it's not that nalu is giving a different id to the ghosted vs the owned node, but rather nalu is giving a different id to the slave node. I don't know if I made that more clear or less clear...
@mhoemmen "ghosted" is a stk concept which means take a node (or other mesh entity) owned on 1 mpi-proc, and make a copy of it on an arbitrary other mpi-proc. "slave" is a nalu concept which says "consider 1 node to be the same as another (master) node, regardless of whether both nodes are on the same mpi-proc or not". stk-mesh knows nothing about the 'slaving', and considers the nodes to still be distinct and different. nalu associates the global-id of the master node with the slave node for the purposes of Tpetra objects.
The notion of a slave node only occurs in periodic BC. The two BC planes considered periodic have coincident nodes (if one of the planes in the pair was translated to match the other plane). As we try to sumInto matrix, Nalu intercepts every access of slave node's row ID and instead passes the master node's row... so all elements connected to the slave node populate the row for the master row. And the elements connected to the master node access the master node's row ID as usual.
Ghosted elements occur when dealing with non-conformal or overset BC where a particular row (for a node) actually has entries from an element belonging to a different mesh block. Ghosting is necessary here to determine contributions to a particular node on MPI rank X when the element it needs contribution from exists on MPI rank Y. The non-conformal BC description in the theory manual might be useful in visualizing this.
I guess you could handle periodic BCs like Dirichlet BCs, by treating owned and ghosted versions as separate DOFs and using the matrix to constrain them to be equal. However, that could lead to trouble, because linear solvers only promise accuracy in the global L2 norm. You really need a stronger norm guarantee to get boundary conditions right that way.
Past codes at SNL managed periodic this way. However, the easiest manner to manage periodic is to manage ids such that they periodic pairs/triplets are the same.
This thread is getting rather long. It may be better to attack the problem with a short meeting and design documents.
For the record, we never wanted dofStatus. We always wanted a design that would remove load_complete() at the price of serving up multiple elements to a locally owned node. This is the Aura-based approach in STK that would ghost all elemental contributions to a locally owned row. This removes dofStatus and pushes a design to manage ghosted entities. In fact, we implemented this design a year+ ago. It was simple and well performing in absence of mesh modification.
The integrated solution will likely be to push towards an aura-based approach with ghosting management once aura is not expensive in STK. Alternatively, we can seek an approach where aura is off while an application custom-ghosting is used for defining all elements connected to a node.
The more I review this thread, the more I feel that there is a need to meet and talk. There are too many concepts floating around. In some cases, I question some of the statements made. This may simply mean my understanding is lacking, however, I would like to close the gap.
@sthomas61, do you feel that this comment is related to the Tpetra overset/periodic use case as sliding mesh currently works fine. The comment below is related to a Hypre interface - at least that was my understanding.
In the case of our V27 mesh, with possibly hanging nodes there is a problem after assembly. When looping through the local rows on a rank, and pulling out the global column indices - and use these to look up a "node" in the STK bulk data - but somehow the aura/ghosting or hanging nodes leads to is_valid(node) = INVALID for some of these nodes.
@spdomin It would make sense for us to write up a document. That would go along with efforts by @tjfulle and @william76 to codify best practices for using Tpetra. We haven't yet codified best practices for things like periodic BCs, oversets, and moving meshes, so now would be a good time :-) .
Okay, later next week, I will add a story that speaks to the set of matrix contribution use cases. The theory document shows some good images for overset and I have a paper that speaks to sliding mesh.
When we first implemented the dofStatus, we excluded periodic being active with sliding-mesh/overset. Very recently, we extended support to include sliding mesh and periodic without too much effort in dofStatus. I thought looking back at that time the modifications required to support periodic/overset were straight forward. The full combination of all three active was the line I was not willing to cross with dofStatus and worked through a high level ghosting manager to get the job done. The issue with the ghosting management is the new ghosting connectivities that need to be made to ensure that one locally owned row in fact has all of the connectivity it needs. For periodic and sliding mesh, this becomes more complex - nothing is for free in this business.
Finally, we know that the ghosting design is what we need for high performance. Other PSAAP-II implicit projects have demonstrated this already. As such, we may want to make that turn sooner than later.
The
TpetraLinearSystem::fillComplete
throws an error when attempting to use periodic BC alongside overset BC. Steps to reproduce, replace inflow/outflow BC inreg_tests/test_files/oversetHybrid
with periodic BC. The following error message is shown below:Cc: @spdomin, @alanw0