GCLTest.rigid_scaling_edge with a Release/RelWithDebInfo GPU CUDA build

marchdf commented 2 hours ago

Works fine with Debug build but this is what I get with a RelWithDebInfo build:

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from GCLTest
[ RUN      ] GCLTest.rigid_scaling_edge
[1]    1567032 segmentation fault (core dumped)  ./unittestX --gtest_filter="GCLTest.rigid_scaling_edge"

Valgrind output:

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from GCLTest
[ RUN      ] GCLTest.rigid_scaling_edge
==1567161== Jump to the invalid address stated on the next line
==1567161==    at 0x0: ???
==1567161==    by 0x11840B82F: ???
==1567161==    by 0x11840B87F: ???
==1567161==    by 0x7D5A51BF: ???
==1567161==    by 0x1FFEFF1D3F: ???
==1567161==    by 0xE: ???
==1567161==    by 0x6C6569665F70676D: ???
==1567161==    by 0x79627078615F63: ???
==1567161==    by 0xBB499B7: ??? (in /mnt/vdb/home/mhenryde/exawind/exawind-manager/stage/spack-stage-nalu-wind-master-ypr3uezvfcpnu5zjnz37kntzpkmvn4d4/spack-build-ypr3uez/unittestX)
==1567161==    by 0x1183FBA4F: ???
==1567161==    by 0x200040EE7F: ???
==1567161==    by 0x100000012: ???
==1567161==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1567161==
==1567161==
==1567161== Process terminating with default action of signal 11 (SIGSEGV)
==1567161==  Bad permissions for mapped region at address 0x0
==1567161==    at 0x0: ???
==1567161==    by 0x11840B82F: ???
==1567161==    by 0x11840B87F: ???
==1567161==    by 0x7D5A51BF: ???
==1567161==    by 0x1FFEFF1D3F: ???
==1567161==    by 0xE: ???
==1567161==    by 0x6C6569665F70676D: ???
==1567161==    by 0x79627078615F63: ???
==1567161==    by 0xBB499B7: ??? (in /mnt/vdb/home/mhenryde/exawind/exawind-manager/stage/spack-stage-nalu-wind-master-ypr3uezvfcpnu5zjnz37kntzpkmvn4d4/spack-build-ypr3uez/unittestX)
==1567161==    by 0x1183FBA4F: ???
==1567161==    by 0x200040EE7F: ???
==1567161==    by 0x100000012: ???
==1567161==
==1567161== HEAP SUMMARY:
==1567161==     in use at exit: 1,563,453,009 bytes in 539,439 blocks
==1567161==   total heap usage: 650,348 allocs, 110,909 frees, 3,356,698,534 bytes allocated
==1567161==
==1567161== LEAK SUMMARY:
==1567161==    definitely lost: 82,352 bytes in 553 blocks
==1567161==    indirectly lost: 317,584 bytes in 1,134 blocks
==1567161==      possibly lost: 250,940,627 bytes in 1,218 blocks
==1567161==    still reachable: 1,312,112,446 bytes in 536,534 blocks
==1567161==                       of which reachable via heuristic:
==1567161==                         stdstring          : 28,313 bytes in 640 blocks
==1567161==         suppressed: 0 bytes in 0 blocks
==1567161== Rerun with --leak-check=full to see details of leaked memory
==1567161==
==1567161== For lists of detected and suppressed errors, rerun with: -s
==1567161== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
[1]    1567161 segmentation fault (core dumped)  valgrind ./unittestX --gtest_filter="GCLTest.rigid_scaling_edge"

LLDB output:

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from GCLTest
[ RUN      ] GCLTest.rigid_scaling_edge
Process 1568711 stopped
* thread #1, name = 'unittestX', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x0)
    frame #0: 0x0000000000000000
error: memory read failed for 0x0
(lldb) bt
* thread #1, name = 'unittestX', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x0)
  * frame #0: 0x0000000000000000
    frame #1: 0x000000000153142c unittestX`sierra::nalu::GeometryAlgDriver::mesh_motion_prework() at nvcc_internal_extended_lambda_implementation:624:85
    frame #2: 0x000000000152f591 unittestX`sierra::nalu::GeometryAlgDriver::pre_work() at GeometryAlgDriver.C:107:26
    frame #3: 0x0000000001506ec2 unittestX`sierra::nalu::NgpAlgDriver::execute(this=0x0000000047fd2790) at NgpAlgDriver.C:36:15
    frame #4: 0x0000000000c1ca4b unittestX`GCLTest_rigid_scaling_edge_Test::TestBody() at UnitTestGCL.h:281:25
    frame #5: 0x0000000004b0960d unittestX`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 61
    frame #6: 0x0000000004afc6c4 unittestX`testing::Test::Run() (.part.0) + 724
    frame #7: 0x0000000004afccb2 unittestX`testing::TestInfo::Run() (.part.0) + 994
    frame #8: 0x0000000004afd016 unittestX`testing::TestSuite::Run() (.part.0) + 294
    frame #9: 0x0000000004afe9a6 unittestX`testing::internal::UnitTestImpl::RunAllTests() + 3542
    frame #10: 0x0000000004afb83f unittestX`testing::UnitTest::Run() + 95
    frame #11: 0x000000000066c89a unittestX`main at gtest.h:14808:47
    frame #12: 0x00007fff90ea37e5 libc.so.6`__libc_start_main + 229
    frame #13: 0x00000000007fc69e unittestX`_start + 46

In NgpFieldBLAS.h, this makes it so the segfault go away:

// nalu_ngp::run_entity_algorithm(
  //   "ngp_field_axpby", ngpMesh, rank, sel, KOKKOS_LAMBDA(const MeshIndex& mi) {
  //     for (unsigned d = 0; d < numComponents; ++d)
  //       yField.get(mi, d) =
  //         alpha * xField.get(mi, d) + beta * yField.get(mi, d);
  //   });

This makes the segfault come back:

nalu_ngp::run_entity_algorithm(
    "ngp_field_axpby", ngpMesh, rank, sel, KOKKOS_LAMBDA(const MeshIndex& mi) {
      // for (unsigned d = 0; d < numComponents; ++d)
      //   yField.get(mi, d) =
      //     alpha * xField.get(mi, d) + beta * yField.get(mi, d);
    });

So just calling run_entity_algorithm causes the segfault. Is sel bad? Does anyone have ideas for the next steps? Tagging @alanw0 and @psakievich.

alanw0 commented 2 hours ago

@djglaze does this match the pattern for the cuda/compiler-bug that we hit a few months ago?

Marc, we hit a bug where a function with lambda like this, would seg-fault if included in multiple compilation units, but run fine if only included by one .C file... Our solution was to use a functor instead of a lambda. A functor is a class object with an operator() method.

marchdf commented 2 hours ago

Interesting... FWIW I get the same on 2 different GPUs/cuda version: H100 with cuda@12.4.1 and A100 with cuda@12.5.1

Do you have an example of the functor conversion that you made that fixed the issue?

alanw0 commented 48 minutes ago

I think you would just do something like this: struct MyFunctor { KOKKOS_FUNCTION void operator()(const MeshIndex& mi) { //the code above that loops over numComponents and sets yField } //data-members xField, yField, alpha, beta, numComponents }; MyFunctor f; f.xField = ...; //etc nalu_ngp::run_entity_algorithm(..., f);

It's ugly but might be a worthy experiment...

Exawind / nalu-wind

GCLTest.rigid_scaling_edge with a Release/RelWithDebInfo GPU CUDA build #1306