Open WillHicks96 opened 3 years ago
Have you tried with flux correction turned off? Also Initial : value
initialization is known to be flakey in SMP mode.
@jobordner Yes, I'm getting the same error even with flux correction turned off for the method_ppm-8.in
test.
I'm getting this behavior with test_cosmo-dd-fc0.in
as well, which doesn't use the value
parameter for initialization.
I might be stating the obvious here, but I'd be a little surprised if this is actually an issue with PPM. I suspect that SMP-issues might be leaving the field values in an invalid state, which causes the error.
Try out this file. This is basically the simplest possible test without using the "value"
initializer (I'm fairly confident that the "inclined_wave"
initializer is thread-safe).
@mabruzzo Still getting the same thing! Here's the output from your test: inclined_wave_smp_error.txt
Thanks for running that. I stand corrected: the issue definitely seems to be related to the PPM Solver. I seem to recall that it worked at some point. I wonder if it broke with PR #24 or #48 or if it was something else.
I was able to reproduce the error locally. I tried running another version of the same simulation where I replaced the PPM Solver with the "hydro-only" version of VL+CT. I haven't verified that the results are what we would expect, but it definitely doesn't crash. I plan to come back and look at this more this afternoon.
Note: you might not be able to run the newer version of the test unless your branch has some recent commits from master. If you're interested, you might be able to see that the following command works: charmrun ++ppn 2 bin/enzo-e ./input/vlct/HD_linear_wave/method_vlct_soundN32.in
As an aside: It would be nice to automatically run some tests in the future to make sure this functionality doesn't break. It would probably be straightforward for me to adapt my Linear Wave Convergence tests to check this.
I'm wondering if it's related to using Intel compilers, these problems run fine on my PC with GNU compilers, but fail on Frontera with Intel. Also, others (@pgrete) have run into correctness issues on Pleiades when using Intel compilers there (after heroic efforts to get it to compile in the first place). For the record, I tried turning off optimization (-O0) but still had problems on Frontera with Intel.
I was able to replicate the issue on my PC with the GNU compiler.
I'm away from that computer right now, but I'll post the compiler version later today.
EDIT: On my computer I have GCC version 9.3.0 (the build for Ubuntu 20.04)
@mabruzzo What problem did you run and can you post the output?
Sorry for the delay. It turns out that I can't exactly replicated Will's problem, but I do seem to be able to replicate problems with the PPM solver. Sorry there's so much here. I tried to be a little systematic about this.
First I tried using: charmrun +n1 ++ppn 3 bin/enzo-e input/PPM/method_ppm-8.in
I got a few different outputs depending on when I ran it (there's clearly a race condition):
"value"
initializer. This is our "known problem", but I'm happy to have gotten a "smoking gun" example of this type of failure.If I try to change input commands slightly, to read:
charmrun ++p 2 bin/enzo-e input/PPM/method_ppm-8.in
Now I get a different error.
I also tried out the command: charmrun ++p 4 bin/enzo-e input/PPM/method_ppm-8.in
which gives an error that says something about ppm in the traceback (although it could be unrelated). I get these 2 variants of the same error: variant 1 and variant 2.
charmrun +n1 ++ppn 3 bin/enzo-e unigrid_test.in
I ran this multiple times and I only seem to get a single error
charmrun ++p 2 bin/enzo-e unigrid_test.in
Here are a few error variants: variant 1 & variant 2. It seems like the error can occur at variable times
charmrun ++p 4 bin/enzo-e unigrid_test.in
Here are 2 variants of the error: variant 1 & variant 2
. Again it seems like this error can occur at variable times
All three variations of the following problem, are able to successfully run to completion:
charmrun +n1 ++ppn 3 bin/enzo-e unigrid_test_vl.in
charmrun ++p 2 bin/enzo-e unigrid_test_vl.in
charmrun ++p 4 bin/enzo-e unigrid_test_vl.in
I still need to verify correctness of these last few runs. However, this seems pretty robust against segmentation faults.
Thanks, Matt! I was able to reproduce the errors with unigrid_test.in
. It looks like it's still due to accessing the flux arrays in Fortran, but does indeed fail with GNU compilers. (To verify that it's from accessing flux arrays, try e.g. changing "do n=0, nsubgrids-1
" in [xyz]euler_sweep.F to "do n=1, nsubgrids-1
" to bypass the loops that copy fluxes.) I'm working on a fix for this.
Great! I'm glad to hear it
I think I've got Enzo-E working correctly in SMP mode, including issues related to fluxes, using expressions in parameter files, with "PPM" (actually refresh), and it seems to run correctly on Plaiedes [@pgrete @peeples] (only GNU compilers tested so far: adapt-L5 regression test ran without grid effects popping up). Changes are in my new-output branch (#97), which is still way behind enzo-project/main. I'm starting on merging changes now, which may take a while.
That's great news. It may be worth trying to get #97 in through a very swift review that focuses more on documentation/testing rather than code -- and then fix open items in separate smaller PRs.
Totally agreed with @pgrete on this one - we'll get PR #97 merged ASAP. In the future, I also agree that smaller PRs would be very helpful from a review standpoint. It's much easier to get people to agree to look at them!
I just tried out the newest changes in the new-output branch using Intel compilers on Frontera, and the original issue of crashes during PPM in SMP mode is still happening for me with that setup (I tried running the method_ppm-8.in
test with and without flux correction).
I'm currently working on rebuilding with GNU compilers to see how that goes!
@WillHicks96 That's annoying. That's with smp = 1 in SConstruct?
@jobordner Yep! It was definitely running in SMP mode
Can you post the output?
Here's the output: ppm_error.txt
This is running on two logical nodes with charmrun +p4 bin/enzo-p +ppn 2 input/method_ppm-8.in
If I use +ppn 1
instead, it runs without issue from what I can tell (just like before).
Using git show
on my installation shows the last commit message as "Fixed all remaining known issues related to SMP mode", which means everything should be up to date.
@WillHicks96 I tweaked the parameters a bit, the Method : flux_correct : single_array parameter may not have been set to the default (true) unless flux correction was explicitly included.
@jobordner That also didn't seem to help (at least with Intel compilers). I've been having trouble building Enzo-E with GNU compilers on Frontera, so I haven't been able to fully test it yet.
@jobordner @WillHicks96 The problems with SMP-mode and PPM have all been addressed, right? So we can close this?
@mabruzzo The workaround that @jobordner ultimately found for this was to compile with a combination of icc and gfortran (instead of ifort). I've been able to run simulations with PPM in SMP mode with this setup, so I think it's okay to close this issue for now.
I wanted to post an update. Using a branch with the changes from PR #256:
PPM-8
test. The test failed after 1.67 sec. Unfortunately I forgot to check the outputs and found this test failure to be difficult to reproduce. For reference, the same test takes ~4.24 seconds to run when it completes. This suggest that a race-condition error may still remainAmrPpm
test fails. This is very reproducible. I think this might be an issue with data dumps.All of these issues were documented using the GNU compilers
I am getting crashes in problems that use PPM whenever I run in SMP mode with >1 PE per logical node. The error happens specifically in
flux_twoshock
, where I get a negative energy message followed by a segfault.With
input/PPM/method_ppm-8.in
, for example, running with the following command gives me this output:If I run with
+ppn 1
instead, the test runs through without any errors.