Closed jrood-nrel closed 4 months ago
I do not want to revert this again. We need to get this in to move forward with trilinos. Let's debug it together.
I think fixing this without a revert is a pill we have to swallow. @djglaze let's loop you into the conversation.
Dave is out on vacation until the 10th, and then after that he has jury duty so his availability is going to be unknown...
Dave is out on vacation until the 10th, and then after that he has jury duty so his availability is going to be unknown...
Of course... Murphy and his ubiquitous law.... Well I am happy to try to track it down.
Arrrgh! Yeah, I'm on the road now between destinations. Sorry about that. I ran all the GPU tests successfully, so I guess our coverage might not be great.
I think I'd attack this with a CPU build using the STK_USE_DEVICE_MESH define. Then, I'd see if the GPU tests reveal anything. We can also then attack some tests with valgrind to see if anything comes up.
Dave
It segfaults on the CPU.
@jrood-nrel @psakievich I'm back in town and available to help out with debugging this issue. I don't have access to Frontier anymore, which hinders thing a bit. Is there a stack trace from the seg-fault? The last one was figure-outable from just the stack trace, so hopefully that will be enough here.
I'm assuming this failure was on a huge problem. Is there any way you can post the input deck somewhere, so that I can see what models were running? The mesh would be great, too, although I'm guessing it's too big to easily share.
@djglaze last I checked with @jrood-nrel the failure was only observed so far when running the exawind driver. I was not able to reproduce locally. It had to do with field accessors so I don't think it is related to the size of the problem
@djglaze last I checked with @jrood-nrel the failure was only observed so far when running the exawind driver. I was not able to reproduce locally. It had to do with field accessors so I don't think it is related to the size of the problem
Thanks, @psakievich. I'm really out of the loop on this project, apparently. I don't know what "exawind driver" means. Is this a particular simulation configuration?
That makes sense that it might not be directly due to the size of the problem. I'm betting it's something tricky with a Field that's registered with different sizes on different parts of the domain, maybe due to mixed element topologies. It's probably a combination of things that we don't have representation for in our unit/regression tests. I've got enough experience with these simple_fields changes over the last couple years that there's a slim chance I can think it through if I can get my hands on a stack trace.
#0 0x00007fffe7abc60f in void stk::mesh::field_fill<double>(double, stk::mesh::FieldBase const&) ()
from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#1 0x00007fffe8210951 in sierra::nalu::NodalGradAlgDriver<stk::mesh::Field<double, void, void, void, void, void, void, void> >::pre_work() ()
from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#2 0x00007fffe820890e in sierra::nalu::NgpAlgDriver::execute() ()
from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#3 0x00007fffe7bd3527 in sierra::nalu::SpecificDissipationRateEquationSystem::assemble_nodal_gradient() ()
from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#4 0x00007fffe7b99aa8 in sierra::nalu::ShearStressTransportEquationSystem::solve_and_update() ()
from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#5 0x00007fffe7a623f4 in sierra::nalu::EquationSystems::solve_and_update() ()
from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#6 0x00007fffe7b75f2f in sierra::nalu::Realm::nonlinear_iterations(int) ()
from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#7 0x00007fffe7b7491b in sierra::nalu::Realm::advance_time_step() ()
from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#8 0x0000000000219f6f in exawind::NaluWind::advance_timestep(unsigned long) ()
#9 0x0000000000224ac4 in exawind::ExawindSolver::call_advance_timestep(unsigned long, bool) ()
#10 0x0000000000222a14 in exawind::OversetSimulation::run_timesteps(int, int, int) ()
#11 0x000000000022c40e in main ()
Thanks, @psakievich. I'm really out of the loop on this project, apparently. I don't know what "exawind driver" means. Is this a particular simulation configuration?
@djglaze the exawind-driver is the top level code that is used to couple nalu-wind and amr-wind. That code links against the libnalu.so
and reconstructs the guts of a Nalu::Simulation
objects operations so they can be interwoven with amr-wind's solve process.
In the stacktrace above you can see the transition based on when the leading namespace switches from exawind
to sierra
.
I've been studying the stack trace that @jrood-nrel provided, and the code looks solid to me. I'm starting to wonder if the issue is elsewhere, and it's just manifesting at this location.
I've been unable (with only a modest amount of effort) to get current versions of spack-manager
or exawind-manager
working for my builds. The documentation is not current, and I was unable to view the Slack channel with more information because my accounts have lapsed. So, I'm running a version of spack-manager
from a couple months ago, before the big modularization refactor and I can't see how things are currently being built. Are you guys still running a version of Trilinos pinned to 2023-02-28?
I hot-wired my local Trilinos version to be new-enough that I could attack it with the STK manual Field memory poisoning tool (that @psakievich is likely familiar with from the Sierra stk-field-asan
dashboard line), and I found a few minor issues among all of the regression and unit tests. PR #1256 fixes what I found, although they are unlikely to be the cause of this seg-fault.
I've identified four STK commits from after 2023-02-28 that fix various potential memory corruption issues. These pre-existing issues were discovered while running all of the Sierra tests, using both the simple_fields changes (that you have) and the new variable-capacity Buckets changes (that you don't have). I think I'd strongly recommend moving your Trilinos version to something much newer, to get the benefit of these fixes as well as others that I undoubtedly missed. Something after 2023-07-12, Trilinos SHA 8be8abe06fec
is preferred.
If you need to stay pinned to 2023-02-28, then I'd recommend applying 4 new Trilinos patches that correspond to these Sierra commits (that @psakievich has access to):
296698c2b92e
: Add a safety check to stk::mesh::field_copy()
nalu-wind
. d8a5bddc652c
: Fixed zero initialization optimization in FieldBLAS::fill()
function
field_fill_component()
or field_fill()
with a non-unity stride and an initial offset past the beginning of the Field. nalu-wind
uses problematic calls in at least 6 places. There are hundreds of additional field_fill()
calls that should be safe, although it's suspicious that this is where the seg-fault occurred.71f879881ced
: Fixed memory error when reading int Fields from disk
int
Fields from an Exodus file, but if you do, then this can write beyond the bounds of the Field.3fe12f190568
: Fixed memory error when skinning elements with multiple side topologies
Moving my Trilinos version forward far enough to scoop up the memory debugging tool also included most of these fixes, so I never directly observed them being a problem in my local runs of all tests. Still, I think they are known issues that are good candidates for fixing John's seg-fault. Beyond this, I'm quickly running out of ideas.
@djglaze see #1255. I can help you get set up to build this week. I don't doubt that there was a struggle with the cee resources at the moment so sorry for the struggle on your end. The old version of spack-manager
should be okay though.
I'm still trying to test some large cases on Frontier, but the latest code runs for the most part for me on smaller problems and other machines, so regardless I agree to not actually reverting this.
Reverts Exawind/nalu-wind#1237
This appears to segfault in field operations when running on Frontier. I think we should revert it.