CliMA / Oceananigans.jl

🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs
https://clima.github.io/OceananigansDocumentation/stable
MIT License
995 stars 195 forks source link

Segmentation fault filling halo regions with `Partition(y=2)` #3878

Open glwagner opened 3 weeks ago

glwagner commented 3 weeks ago

Not sure how this is possible, but the following code throws a segfault:

using Oceananigans
using Oceananigans.BoundaryConditions: fill_halo_regions!

partition = Partition(y=2)
arch = Distributed(GPU(); partition)
x = y = z = (0, 1)
grid = RectilinearGrid(arch; size=(16, 16, 16), x, y, z, topology=(Periodic, Periodic, Bounded))
c = CenterField(grid)
fill_halo_regions!(c)

I'm running with

$ mpiexecjl -n 2 julia --project test_interpolate.jl

(I found this error originally when trying to interpolate a field, but it seems it boils down to a halo filling issue)

This is the error I get:

[ Info: Oceananigans will use 32 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Oceananigans will use 32 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().

[116989] signal (11.2): Segmentation fault
in expression starting at /orcd/data/raffaele/001/glwagner/OceananigansPaper/listings/test_interpolate.jl:10
__memcpy_ssse3 at /lib64/libc.so.6 (unknown line)
MPIDI_CH3_iSendv at /orcd/data/raffaele/001/glwagner/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPIDI_CH3_EagerContigIsend at /orcd/data/raffaele/001/glwagner/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPID_Isend at /orcd/data/raffaele/001/glwagner/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPI_Isend at /orcd/data/raffaele/001/glwagner/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPI_Isend at /orcd/data/raffaele/001/glwagner/.julia/packages/MPI/TKXAj/src/api/generated_api.jl:2151 [inlined]
Isend at /orcd/data/raffaele/001/glwagner/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:66
Isend at /orcd/data/raffaele/001/glwagner/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:70 [inlined]
Isend at /orcd/data/raffaele/001/glwagner/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:70 [inlined]
send_south_halo at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:317
#fill_south_and_north_halo!#50 at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:263
fill_south_and_north_halo! at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:250
unknown function (ip: 0x2aaac8afa8b6)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
#fill_halo_event!#40 at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:208
fill_halo_event! at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:193
unknown function (ip: 0x2aaac8aefb2e)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
#fill_halo_regions!#38 at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:114
fill_halo_regions! at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:101 [inlined]
#fill_halo_regions!#37 at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:90 [inlined]
fill_halo_regions! at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:87
unknown function (ip: 0x2aaac8ad0ee5)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2076
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
_include at ./loading.jl:2136
include at ./Base.jl:495
jfptr_include_46447.1 at /orcd/data/raffaele/001/glwagner/Software/julia-1.10.5/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_82798.1 at /orcd/data/raffaele/001/glwagner/Software/julia-1.10.5/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 26236174 (Pool: 26209699; Big: 26475); GC: 35

I'll test CPU then try to see if this situation is tested.

glwagner commented 3 weeks ago

Why don't we test the distributed NonhydrostaticModel here?

https://github.com/CliMA/Oceananigans.jl/blob/9ffbee31bc5a2fa38dd93fa1594b94cddaebba8c/test/test_distributed_models.jl#L451-L456

or are there tests elsewhere?

glwagner commented 3 weeks ago

The test architectures are specified here:

https://github.com/CliMA/Oceananigans.jl/blob/9ffbee31bc5a2fa38dd93fa1594b94cddaebba8c/test/utils_for_runtests.jl#L6-L24

This was hard to find at first

glwagner commented 3 weeks ago

Are the distributed GPU tests actually running?

I see this:

https://buildkite.com/clima/oceananigans-distributed/builds/4081#0192d4e4-191f-48e1-a943-d82377d8a125/189-1099

And then subsequently it looks like the architecture is Distributed{CPU}.

We need a better way to specify the test architectures?

glwagner commented 3 weeks ago

@simone-silvestri

simone-silvestri commented 3 weeks ago

Damn, it looks like the tests on the GPU are not working because CUDA is not loaded properly. I am trying to address this in #3880. A segmentation fault probably means the MPI is not CUDA-aware. Typically, the MPI that is shipped with MPI_jll is not cuda-aware. A good way to check is

julia> using MPI

julia> MPI.has_cuda()
true
glwagner commented 3 weeks ago

Thank @simone-silvestri, it turns out that I wasn't using CUDA-aware MPI.

3883 addresses this by adding an error if CUDA-aware MPI is not available, so that we are not confronted with a mysterious segmentation fault (which could be caused by any number of issues, not just CUDA-aware MPI).

Since we don't have GPU tests right now I will also check to make sure that this runs with a proper CUDA-aware MPI.