enzo-project / enzo-e

A version of Enzo designed for exascale and built on charm++.
Other
29 stars 35 forks source link

PPM Crashing in SMP Mode #124

Open WillHicks96 opened 3 years ago

WillHicks96 commented 3 years ago

I am getting crashes in problems that use PPM whenever I run in SMP mode with >1 PE per logical node. The error happens specifically in flux_twoshock, where I get a negative energy message followed by a segfault.

With input/PPM/method_ppm-8.in, for example, running with the following command gives me this output:

charmrun +p4 bin/enzo-e +ppn 2 input/PPM/method_ppm-8.in

If I run with +ppn 1 instead, the test runs through without any errors.

jobordner commented 3 years ago

Have you tried with flux correction turned off? Also Initial : value initialization is known to be flakey in SMP mode.

WillHicks96 commented 3 years ago

@jobordner Yes, I'm getting the same error even with flux correction turned off for the method_ppm-8.in test.

I'm getting this behavior with test_cosmo-dd-fc0.in as well, which doesn't use the value parameter for initialization.

mabruzzo commented 3 years ago

I might be stating the obvious here, but I'd be a little surprised if this is actually an issue with PPM. I suspect that SMP-issues might be leaving the field values in an invalid state, which causes the error.

Try out this file. This is basically the simplest possible test without using the "value" initializer (I'm fairly confident that the "inclined_wave" initializer is thread-safe).

WillHicks96 commented 3 years ago

@mabruzzo Still getting the same thing! Here's the output from your test: inclined_wave_smp_error.txt

mabruzzo commented 3 years ago

Thanks for running that. I stand corrected: the issue definitely seems to be related to the PPM Solver. I seem to recall that it worked at some point. I wonder if it broke with PR #24 or #48 or if it was something else.

I was able to reproduce the error locally. I tried running another version of the same simulation where I replaced the PPM Solver with the "hydro-only" version of VL+CT. I haven't verified that the results are what we would expect, but it definitely doesn't crash. I plan to come back and look at this more this afternoon.

Note: you might not be able to run the newer version of the test unless your branch has some recent commits from master. If you're interested, you might be able to see that the following command works: charmrun ++ppn 2 bin/enzo-e ./input/vlct/HD_linear_wave/method_vlct_soundN32.in

As an aside: It would be nice to automatically run some tests in the future to make sure this functionality doesn't break. It would probably be straightforward for me to adapt my Linear Wave Convergence tests to check this.

jobordner commented 3 years ago

I'm wondering if it's related to using Intel compilers, these problems run fine on my PC with GNU compilers, but fail on Frontera with Intel. Also, others (@pgrete) have run into correctness issues on Pleiades when using Intel compilers there (after heroic efforts to get it to compile in the first place). For the record, I tried turning off optimization (-O0) but still had problems on Frontera with Intel.

mabruzzo commented 3 years ago

I was able to replicate the issue on my PC with the GNU compiler.

I'm away from that computer right now, but I'll post the compiler version later today.

EDIT: On my computer I have GCC version 9.3.0 (the build for Ubuntu 20.04)

jobordner commented 3 years ago

@mabruzzo What problem did you run and can you post the output?

mabruzzo commented 3 years ago

Sorry for the delay. It turns out that I can't exactly replicated Will's problem, but I do seem to be able to replicate problems with the PPM solver. Sorry there's so much here. I tried to be a little systematic about this.

method_ppm-8.in

First I tried using: charmrun +n1 ++ppn 3 bin/enzo-e input/PPM/method_ppm-8.in

I got a few different outputs depending on when I ran it (there's clearly a race condition):

  1. This first output demonstrates an issue with the "value" initializer. This is our "known problem", but I'm happy to have gotten a "smoking gun" example of this type of failure.
  2. This second output looks likes this. This might just be a problem with how my laptop is set up.
  3. I think that this third output might be a variant of the second one (maybe it just happened on a different PE?)

If I try to change input commands slightly, to read: charmrun ++p 2 bin/enzo-e input/PPM/method_ppm-8.in Now I get a different error.

I also tried out the command: charmrun ++p 4 bin/enzo-e input/PPM/method_ppm-8.in which gives an error that says something about ppm in the traceback (although it could be unrelated). I get these 2 variants of the same error: variant 1 and variant 2.

unigrid_test.in

charmrun +n1 ++ppn 3 bin/enzo-e unigrid_test.in I ran this multiple times and I only seem to get a single error

charmrun ++p 2 bin/enzo-e unigrid_test.in Here are a few error variants: variant 1 & variant 2. It seems like the error can occur at variable times

charmrun ++p 4 bin/enzo-e unigrid_test.in Here are 2 variants of the error: variant 1 & variant 2 . Again it seems like this error can occur at variable times

unigrid_test_vl.in

All three variations of the following problem, are able to successfully run to completion:

I still need to verify correctness of these last few runs. However, this seems pretty robust against segmentation faults.

jobordner commented 3 years ago

Thanks, Matt! I was able to reproduce the errors with unigrid_test.in. It looks like it's still due to accessing the flux arrays in Fortran, but does indeed fail with GNU compilers. (To verify that it's from accessing flux arrays, try e.g. changing "do n=0, nsubgrids-1" in [xyz]euler_sweep.F to "do n=1, nsubgrids-1" to bypass the loops that copy fluxes.) I'm working on a fix for this.

mabruzzo commented 3 years ago

Great! I'm glad to hear it

jobordner commented 3 years ago

I think I've got Enzo-E working correctly in SMP mode, including issues related to fluxes, using expressions in parameter files, with "PPM" (actually refresh), and it seems to run correctly on Plaiedes [@pgrete @peeples] (only GNU compilers tested so far: adapt-L5 regression test ran without grid effects popping up). Changes are in my new-output branch (#97), which is still way behind enzo-project/main. I'm starting on merging changes now, which may take a while.

pgrete commented 3 years ago

That's great news. It may be worth trying to get #97 in through a very swift review that focuses more on documentation/testing rather than code -- and then fix open items in separate smaller PRs.

bwoshea commented 3 years ago

Totally agreed with @pgrete on this one - we'll get PR #97 merged ASAP. In the future, I also agree that smaller PRs would be very helpful from a review standpoint. It's much easier to get people to agree to look at them!

WillHicks96 commented 3 years ago

I just tried out the newest changes in the new-output branch using Intel compilers on Frontera, and the original issue of crashes during PPM in SMP mode is still happening for me with that setup (I tried running the method_ppm-8.in test with and without flux correction).

I'm currently working on rebuilding with GNU compilers to see how that goes!

jobordner commented 3 years ago

@WillHicks96 That's annoying. That's with smp = 1 in SConstruct?

WillHicks96 commented 3 years ago

@jobordner Yep! It was definitely running in SMP mode

jobordner commented 3 years ago

Can you post the output?

WillHicks96 commented 3 years ago

Here's the output: ppm_error.txt

This is running on two logical nodes with charmrun +p4 bin/enzo-p +ppn 2 input/method_ppm-8.in If I use +ppn 1 instead, it runs without issue from what I can tell (just like before).

Using git show on my installation shows the last commit message as "Fixed all remaining known issues related to SMP mode", which means everything should be up to date.

jobordner commented 3 years ago

@WillHicks96 I tweaked the parameters a bit, the Method : flux_correct : single_array parameter may not have been set to the default (true) unless flux correction was explicitly included.

WillHicks96 commented 3 years ago

@jobordner That also didn't seem to help (at least with Intel compilers). I've been having trouble building Enzo-E with GNU compilers on Frontera, so I haven't been able to fully test it yet.

mabruzzo commented 2 years ago

@jobordner @WillHicks96 The problems with SMP-mode and PPM have all been addressed, right? So we can close this?

WillHicks96 commented 2 years ago

@mabruzzo The workaround that @jobordner ultimately found for this was to compile with a combination of icc and gfortran (instead of ifort). I've been able to run simulations with PPM in SMP mode with this setup, so I think it's okay to close this issue for now.

mabruzzo commented 2 years ago

I wanted to post an update. Using a branch with the changes from PR #256:

All of these issues were documented using the GNU compilers