Exawind / amr-wind

AMReX-based structured wind solver
https://exawind.github.io/amr-wind
Other
103 stars 78 forks source link

Segfaults on initial MAC projection #1008

Closed gyalla closed 1 month ago

gyalla commented 3 months ago

I am encountering segfault issues with AMR-Wind on the initial MAC projection step, which aborts with the following message:

System                     Iters      Initial residual        Final residual
  ----------------------------------------------------------------------------

L-inf norm MAC vels: before MAC projection
..............................................................................
Max u:          10.71283853 |  Location (x,y,z):       3765,     1822.5,      312.5
Min u:         -10.03829912 |  Location (x,y,z):          0,          0,          0
Max v:          3.259859803 |  Location (x,y,z):          0,          0,          0
Min v:         -3.510260129 |  Location (x,y,z):     4332.5,        660,      552.5
Max w:          2.517395188 |  Location (x,y,z):    2253.75,     916.25,      102.5
Min w:         -1.063436419 |  Location (x,y,z):      912.5,     1112.5,         20
..............................................................................

  MAC_projection                 6           4.431624589       1.873963523e-06
[ec749:54683] *** Process received signal ***
[ec749:54683] Signal: Segmentation fault (11)
[ec749:54683] Signal code: Address not mapped (1)
[ec749:54683] Failing at address: 0x1d74b000
[ec749:54683] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2aaab2791630]
[ec749:54683] [ 1] amr_wind[0x505a9a]
[ec749:54683] [ 2] amr_wind[0x50631b]
[ec749:54683] [ 3] amr_wind[0x50fcf1]
[ec749:54683] [ 4] amr_wind[0x70a664]
[ec749:54683] [ 5] amr_wind[0x448f33]
[ec749:54683] [ 6] amr_wind[0x44ba9a]
[ec749:54683] [ 7] amr_wind[0x44e9c0]
[ec749:54683] [ 8] amr_wind[0x43e699]
[ec749:54683] [ 9] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaab29c0555]
[ec749:54683] [10] amr_wind[0x447cdb]
[ec749:54683] *** End of error message ***

The issue is node dependent, and I have compiled the results from multiple tests that myself and others (including @lawrenceccheung and @jfrederik-nrel) have run in the following table:

Machine | Processor count | Result
------------------------------------
Frontier |   736 GPU      | Works
Frontier |   1792 CPU     | Failed
Eagle    |   512 CPU      | Failed
SNL      |   288 CPU      | Works
SNL      |   512 CPU      | Failed
SNL      |   736 CPU      | Works
SNL      |   1792 CPU     | Failed
SNL      |   2304 CPU     | Works

From these tests we have evidence that

  1. The issue is independent of hardware/architecture, that if it works/fails on one machine with one set of processors that it’ll work/fail on another
  2. The issue is independent of GPU or CPU builds

Additionally, for a specific case, I have tested all permutations of including/excluding refinement zones, actuators, and boundary I/O, and the results are summarized in the following table:

Refinement | Actuator | IO | Result
----------------------------------
off        |    off   | on  | works
on         |    off   | on  | fails
on         |    on    | on  | fails
on         |    on    | off | works
off        |    on    | off | works
off        |    off   | off | works
on         |    off   | off | works
off        |    on    | on  | works

It seems in this case the issue requires refinement zones and boundary I/O but not necessarily the actuator. I have attached the input file and output file for this case, which demonstrates this issue. Further, slight adjustments to the refinement zones can make the segfaults appear/disappear, which suggest the problem is indeed related the MLMG solver.

MedWS_LowTI_farmrun1_inp.txt slurm-18767375_out.txt

The issue may be related to some of the other issues reported, e.g., https://github.com/Exawind/amr-wind/issues/941 and https://github.com/Exawind/amr-wind/issues/886

Gopal Yalla

jfrederik-nrel commented 3 months ago

I can add that on both Eagle and Kestrel, this case runs for me with 736 CPUs (24 nodes on Eagle, 16 nodes on Kestrel) but not for 512 CPUs. It therefore indeed very much looks like the number of CPUs that works for this job is irrespective of the machine it runs on.

marchdf commented 3 months ago

@jfrederik-nrel does the 512 case work if you turn off the boundary plane IO? Or the refinement? As in does your 512 case also respond like that matrix @gyalla posted?

marchdf commented 3 months ago

Another question... what is the refinement like? Does it touch any BC? Can you post a pic of the farm (looks like this was made with amrwind-frontend)?

lawrenceccheung commented 3 months ago

This is actually using the same case as https://github.com/Exawind/amr-wind/issues/1009, the domain should resemble this: image but with only levels 0 and 1. If you turn off refinement (only have level 0), then the MLMG algorithm usually works with issues, which is what I've observed in other cases as well.

Lawrence

marchdf commented 3 months ago

And in the z-direction? It touches the bottom BC?

gyalla commented 3 months ago

This case was indeed made with the amrwind-frontend. Here is an image of the refinement zone:

image

The level 1 mesh does extend below the bottom z-boundary. It does seem there is no segfault issue in this particular case if I change the input file to:

tagging.Farm_level_0_zone.Farm_level_0_zone.origin = 1080.0 280.0 0.0

It's hard for me to say if that's just because the MLMG solver is sensitive, or if the issue is tied to refinement zones crossing the boundary

marchdf commented 3 months ago

That's interesting. Can you get anything to fail if you start the refinement at z=0 (for all those cases above)?

gyalla commented 3 months ago

Right now I can only confirm that the refinement on, boundary I/O on, and actuator off case seems to work with 512 CPUs if the refinement is adjusted to start at z=0. I will check the other cases and let you know.

@jfrederik-nrel, did any of your refinement zones happen to intersect the boundary when you were experiencing this segfault issue?

jfrederik-nrel commented 3 months ago

@gyalla, in z-direction my refinement would start at the bottom boundary (z=0), but apart from that my refinements are similar to yours in the sense that they are relatively in the middle of the domain in x- and y-direction. I don't remember whether I ran it without turbine, but with turbine it definitely didn't run on 512 CPUs with these settings.

lawrenceccheung commented 3 months ago

I've often wondered about the effect of a refinement zone intersecting the bottom boundary. We have other cases where it successfully ran without hitting issues in the MLMG solver, and the refinements went all the way to/past the ground: https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/NeutralABL_turbine1/AMR.5kmX5km_turbine1/farmrun1.ipynb However, we didn't try all possible node/cpu configurations, so it might just be a coincidence that it worked.

Lawrence

marchdf commented 3 months ago

I would be very surprised if telling it to refine past the bottom BC had an effect. Looking at the code for the geom/box refinement, it should just be ignored. But there might be a subtlety I am missing...

It would be fantastic if we could find a reproducer at small node counts.

gyalla commented 2 months ago

One issue I forgot to bring up during our call just now. I'm rerunning the uniform and precursor cases again to ensure the results do not change when going from OpenFAST 3.4.1 to OpenFAST 3.5.3. I'm encountering segfaults in the initial MAC projection step and am only able to find one (relatively small) node/core count that will run the precursor case.

I've tried to summarize my setups in the following two slides.

Screenshot 2024-05-06 at 1 47 15 PM Screenshot 2024-05-06 at 1 47 26 PM

It seems like I'm encountering the seg-faults more frequently than before, but perhaps I'm just getting unlucky with my node/core count selection. Several of the previous node/core counts no longer seem to work for me. Perhaps it's not too surprising that there are differences given that the compilers, the AMR-Wind versions, and openfast versions are all different.

Did any of the previous debugging efforts from the last few months provide insight into this issue? @jfrederik-nrel, have you encountered this problem again / tried running with openFAST 3.5.3?

Best, Gopal

jfrederik-nrel commented 2 months ago

Hi Gopal,

I haven't tried running AMR with OF v3.5 yet, but I am also seeing a lot of segmentation faults when running AMR on Kestrel recently. Core/node counts that used to work (both on Eagle and on Kestrel, which got updated recently) are now getting seg faults at the first MAC step, or sometimes even earlier. I have not yet been able to get it running with any kind of core/node count, but I could try the 8x36 set-up that you are using to see if that works.

I'm working with a lot of people here at NREL to figure out where these segmentation faults are coming from, but so far without luck. I have not been able to run any kind of precursor simulation with OF using more than 2 nodes on Kestrel. I'm really not sure whether it's an issue with Kestrel or with AMR/OF, but it is definitely a big concern for me right now, so if anyone can commit time to helping me figure this out, I would very much appreciate it.

Joeri

gyalla commented 2 months ago

Thanks for the update, Joeri, and please let me know if I can help in anyway. I'm seeing the issue across Sandia machines including eclipse, ghost and doom. There was also one case that ran successfully on eclipse but not on ghost with the same parameters which suggests (part of) the bug is machine dependent, contrary to our initial findings.

jfrederik-nrel commented 2 months ago

Quick update: it looks like I got 8 nodes, 36 cores running on Kestrel, but not any other node/core combination that I tried. I'm running AMR v1.4.0 with OF v3.4.0.

gyalla commented 2 months ago

Interesting that combination seems to work across machines. There was some talk today on our end about trying intel compilers to see if that would alleviate the issue. Have you explored different compiler options yet?

marchdf commented 2 months ago

It would be helpful to have a stack trace. Can anyone post a stack trace of a debug build?

jfrederik-nrel commented 2 months ago

@gyalla I've tried building with intel and gcc, with identical results.

arswalid commented 2 months ago

this is on Kestrel MicrosoftTeams-image (5)

marchdf commented 2 months ago

hmmm I saw the same on Kestrel in early April. This particular one was annoying because it wouldn't show up with gcc. I am worried we have a compiler bug. In this issue, people are talking about segfaults without openfast turbines in there as well, correct? So maybe we are conflating things.

Screenshot 2024-04-05 at 2 29 30 PM

lawrenceccheung commented 2 months ago

I'm working on getting a debug build on the Sandia side. Yeah, the stack trace that @arswalid and @marchdf see might be an openfast related compile problem, and it seems to be failing at a different point than what @gyalla saw.

To be clear, that's still a problem, but possibly a different problem.

Lawrence

WeiqunZhang commented 1 month ago

The issue might be that part of the fine level is outside the domain. We have never considered such cases in amrex.

lawrenceccheung commented 1 month ago

Hi @WeiqunZhang, we saw a similar issue with another case detailed here: https://github.com/Exawind/amr-wind/issues/941. In that set up, the refinement zones did not go outside the domain, although they did coincide with the lower boundary.

Lawrence

mbkuhn commented 1 month ago

Hi all, let's keep this issue to a single topic. The kestrel - openFAST problems should go on a separate GitHub issue. For the initial problem "Segfaults on initial MAC projection" were not issues with the MAC projection at all, but a routine that can find max and min velocities, as well as their locations, going on. This is not an important function, and it is only called when the verbosity > 2. Copy-pasting what I put on issue #941

UPDATE: Though the problem function has a bug that should be fixed, the immediate solution (requiring no code change) is just to set incflo.verbose to a lower number. My recommendation for most users would be to use incflo.verbose = 0 because many of the verbose outputs are for helping developers diagnose issues and otherwise make the log files hard to read.

The bug fix should be coming soon, but you don't need to wait for it if you change that input.