Exawind / nalu-wind

Solver for wind farm simulations targeting exascale computational platforms
https://nalu-wind.readthedocs.io
Other
122 stars 83 forks source link

Revert "Converted to the new STK simple_fields workflow" #1253

Closed jrood-nrel closed 4 months ago

jrood-nrel commented 5 months ago

Reverts Exawind/nalu-wind#1237

This appears to segfault in field operations when running on Frontier. I think we should revert it.

psakievich commented 5 months ago

I do not want to revert this again. We need to get this in to move forward with trilinos. Let's debug it together.

alanw0 commented 5 months ago

I think fixing this without a revert is a pill we have to swallow. @djglaze let's loop you into the conversation.

Dave is out on vacation until the 10th, and then after that he has jury duty so his availability is going to be unknown...

psakievich commented 5 months ago

Dave is out on vacation until the 10th, and then after that he has jury duty so his availability is going to be unknown...

Of course... Murphy and his ubiquitous law.... Well I am happy to try to track it down.

djglaze commented 5 months ago

Arrrgh! Yeah, I'm on the road now between destinations. Sorry about that. I ran all the GPU tests successfully, so I guess our coverage might not be great.

I think I'd attack this with a CPU build using the STK_USE_DEVICE_MESH define. Then, I'd see if the GPU tests reveal anything. We can also then attack some tests with valgrind to see if anything comes up.

Dave

jrood-nrel commented 5 months ago

It segfaults on the CPU.

djglaze commented 4 months ago

@jrood-nrel @psakievich I'm back in town and available to help out with debugging this issue. I don't have access to Frontier anymore, which hinders thing a bit. Is there a stack trace from the seg-fault? The last one was figure-outable from just the stack trace, so hopefully that will be enough here.

I'm assuming this failure was on a huge problem. Is there any way you can post the input deck somewhere, so that I can see what models were running? The mesh would be great, too, although I'm guessing it's too big to easily share.

psakievich commented 4 months ago

@djglaze last I checked with @jrood-nrel the failure was only observed so far when running the exawind driver. I was not able to reproduce locally. It had to do with field accessors so I don't think it is related to the size of the problem

djglaze commented 4 months ago

@djglaze last I checked with @jrood-nrel the failure was only observed so far when running the exawind driver. I was not able to reproduce locally. It had to do with field accessors so I don't think it is related to the size of the problem

Thanks, @psakievich. I'm really out of the loop on this project, apparently. I don't know what "exawind driver" means. Is this a particular simulation configuration?

That makes sense that it might not be directly due to the size of the problem. I'm betting it's something tricky with a Field that's registered with different sizes on different parts of the domain, maybe due to mixed element topologies. It's probably a combination of things that we don't have representation for in our unit/regression tests. I've got enough experience with these simple_fields changes over the last couple years that there's a slim chance I can think it through if I can get my hands on a stack trace.

jrood-nrel commented 4 months ago
#0  0x00007fffe7abc60f in void stk::mesh::field_fill<double>(double, stk::mesh::FieldBase const&) ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#1  0x00007fffe8210951 in sierra::nalu::NodalGradAlgDriver<stk::mesh::Field<double, void, void, void, void, void, void, void> >::pre_work() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#2  0x00007fffe820890e in sierra::nalu::NgpAlgDriver::execute() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#3  0x00007fffe7bd3527 in sierra::nalu::SpecificDissipationRateEquationSystem::assemble_nodal_gradient() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#4  0x00007fffe7b99aa8 in sierra::nalu::ShearStressTransportEquationSystem::solve_and_update() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#5  0x00007fffe7a623f4 in sierra::nalu::EquationSystems::solve_and_update() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#6  0x00007fffe7b75f2f in sierra::nalu::Realm::nonlinear_iterations(int) ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#7  0x00007fffe7b7491b in sierra::nalu::Realm::advance_time_step() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#8  0x0000000000219f6f in exawind::NaluWind::advance_timestep(unsigned long) ()
#9  0x0000000000224ac4 in exawind::ExawindSolver::call_advance_timestep(unsigned long, bool) ()
#10 0x0000000000222a14 in exawind::OversetSimulation::run_timesteps(int, int, int) ()
#11 0x000000000022c40e in main ()
psakievich commented 4 months ago

Thanks, @psakievich. I'm really out of the loop on this project, apparently. I don't know what "exawind driver" means. Is this a particular simulation configuration?

@djglaze the exawind-driver is the top level code that is used to couple nalu-wind and amr-wind. That code links against the libnalu.so and reconstructs the guts of a Nalu::Simulation objects operations so they can be interwoven with amr-wind's solve process.

In the stacktrace above you can see the transition based on when the leading namespace switches from exawind to sierra.

djglaze commented 4 months ago

I've been studying the stack trace that @jrood-nrel provided, and the code looks solid to me. I'm starting to wonder if the issue is elsewhere, and it's just manifesting at this location.

I've been unable (with only a modest amount of effort) to get current versions of spack-manager or exawind-manager working for my builds. The documentation is not current, and I was unable to view the Slack channel with more information because my accounts have lapsed. So, I'm running a version of spack-manager from a couple months ago, before the big modularization refactor and I can't see how things are currently being built. Are you guys still running a version of Trilinos pinned to 2023-02-28?

I hot-wired my local Trilinos version to be new-enough that I could attack it with the STK manual Field memory poisoning tool (that @psakievich is likely familiar with from the Sierra stk-field-asan dashboard line), and I found a few minor issues among all of the regression and unit tests. PR #1256 fixes what I found, although they are unlikely to be the cause of this seg-fault.

I've identified four STK commits from after 2023-02-28 that fix various potential memory corruption issues. These pre-existing issues were discovered while running all of the Sierra tests, using both the simple_fields changes (that you have) and the new variable-capacity Buckets changes (that you don't have). I think I'd strongly recommend moving your Trilinos version to something much newer, to get the benefit of these fixes as well as others that I undoubtedly missed. Something after 2023-07-12, Trilinos SHA 8be8abe06fec is preferred.

If you need to stay pinned to 2023-02-28, then I'd recommend applying 4 new Trilinos patches that correspond to these Sierra commits (that @psakievich has access to):

Moving my Trilinos version forward far enough to scoop up the memory debugging tool also included most of these fixes, so I never directly observed them being a problem in my local runs of all tests. Still, I think they are known issues that are good candidates for fixing John's seg-fault. Beyond this, I'm quickly running out of ideas.

psakievich commented 4 months ago

@djglaze see #1255. I can help you get set up to build this week. I don't doubt that there was a struggle with the cee resources at the moment so sorry for the struggle on your end. The old version of spack-manager should be okay though.

jrood-nrel commented 4 months ago

I'm still trying to test some large cases on Frontier, but the latest code runs for the most part for me on smaller problems and other machines, so regardless I agree to not actually reverting this.