GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
221 stars 87 forks source link

bug in testLifoStorage #3355

Open rrsettgast opened 2 months ago

rrsettgast commented 2 months ago

Describe the bug Occasionally there is an error in the CI on testLifoStorage. It is not reproducible, and rerunning will pass most of the time. However, this isn't a good thing to have laying around. It can be seen here:

https://github.com/GEOS-DEV/GEOS/actions/runs/10854274728/job/30124368903?pr=3340

The output for the failed testLifoStorage is here:

 90/211 Test  #90: testLifoStorage ......................................Subprocess aborted***Exception:   0.71 sec
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from LifoStorageTest
[ RUN      ] LifoStorageTest.LifoStorageBufferOnCUDA
Allocated    40.0 B to the HOST  : LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Allocated    40.0 B to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Moved    40.0 B to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer> 
 LIFO : maximum size 10 buffers 
 LIFO : buffer size 3.8147e-05MB
 LIFO : allocating 3 buffers on host
 LIFO : allocating 2 buffers on device
Allocated   120.0 B to the HOST  : LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Allocated    80.0 B to the DEVICE: LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Freed    80.0 B to the DEVICE: LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Freed   120.0 B to the HOST  : LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Freed    40.0 B to the HOST  : LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Freed    40.0 B to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
[       OK ] LifoStorageTest.LifoStorageBufferOnCUDA (99 ms)
[ RUN      ] LifoStorageTest.LifoStorageBufferOnCUDAlarge
Allocated    3.8 MB to the HOST  : LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Allocated    3.8 MB to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
Moved    3.8 MB to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer> 
 LIFO : maximum size 10000 buffers 
 LIFO : buffer size 3.8147MB
 LIFO : allocating 3 buffers on host
 LIFO : allocating 2 buffers on device
Allocated   11.4 MB to the HOST  : LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 10.9 GB
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [5, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [6, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [7, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [8, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

***** ERROR
***** LOCATION: /tmp/geos/src/coreComponents/common/unitTests/testLifoStorage.cpp:110
***** Block: [0, 0, 0]
***** Thread: [9, 0, 0]
***** Controlling expression (should be false): dataPointer[ i ] != (float)(totalNumberOfBuffers-j-1)*elemCnt+i
***** MSG: "" << "\n" << "Expected " << "dataPointer[ i ]" << " " << "==" << " " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << "\n" << "  " << "dataPointer[ i ]" << " = " << dataPointer[ i ] << "\n" << "  " << "(float)(totalNumberOfBuffers-j-1)*elemCnt+i" << " = " << (float)(totalNumberOfBuffers-j-1)*elemCnt+i << "\n"

Freed   120.0 B to the HOST  : LvArray::Array<float, 2, camp::int_seq<long, 0l, 1l>, long, LvArray::ChaiBuffer>  Free memory on device: 0.0 B
Freed    40.0 B to the HOST  : LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 0.0 B
Freed    40.0 B to the DEVICE: LvArray::Array<float, 1, camp::int_seq<long, 0l>, int, LvArray::ChaiBuffer>  Free memory on device: 0.0 B
terminate called after throwing an instance of 'umpire::runtime_error'
  what():  ! Umpire runtime_error [/tmp/build/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaMallocAllocator.hpp:62]: cudaFree( ptr = 0x7fd642600000 ) failed with error: unspecified launch failure
    Backtrace: 19 frames
    0 0x7fd678bfbe81 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire4util49_GLOBAL__N__70505b0d_16_ArrayManager_cpp_ab41d17d15build_backtraceEv+0x31) [0x7fd678bfbe81]
    1 0x7fd678c02e60 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZNK6umpire13runtime_error7messageB5cxx11Ev+0x20) [0x7fd678c02e60]
    2 0x7fd678c02a4b No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire13runtime_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_i+0x13b) [0x7fd678c02a4b]
    3 0x7fd678c5a3e4 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire5alloc19CudaMallocAllocator10deallocateEPv+0x2d4) [0x7fd678c5a3e4]
    4 0x7fd678c59936 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire8resource24CudaDeviceMemoryResource10deallocateEPvm+0x266) [0x7fd678c59936]
    5 0x7fd678c023f4 No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN6umpire9Allocator13do_deallocateEPv+0x294) [0x7fd678c023f4]
    6 0x7fd678bfe36e No dladdr: /tmp/geos-build/lib/libcommon.so(_ZN4chai12ArrayManager4freeEPNS_13PointerRecordENS_14ExecutionSpaceE+0x24e) [0x7fd678bfe36e]
    7 0x441ffb No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7LvArray10ChaiBufferIfE4freeEv+0x3b) [0x441ffb]
    8 0x43d49b No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN4geos15testLifoStorageIN4RAJA6policy4cuda18cuda_exec_explicitINS1_17iteration_mapping6DirectENS1_4cuda11IndexGlobalILNS1_9named_dimE0ELi32ELi0EEENS7_23MaxOccupancyConcretizerELm1ELb0EEEEEviiii+0xa5b) [0x43d49b]
    9 0x49f799 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x49) [0x49f799]
    10 0x483e08 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing4Test3RunEv+0xd8) [0x483e08]
    11 0x484e00 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8TestInfo3RunEv+0x130) [0x484e00]
    12 0x485905 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing9TestSuite3RunEv+0x2d5) [0x485905]
    13 0x495cbd No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x41d) [0x495cbd]
    14 0x4a03e9 No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x49) [0x4a03e9]
    15 0x49586a No dladdr: /tmp/geos-build/tests/testLifoStorage(_ZN7testing8UnitTest3RunEv+0x5a) [0x49586a]
    16 0x43c1dc No dladdr: /tmp/geos-build/tests/testLifoStorage(main+0x1c) [0x43c1dc]
    17 0x7fd6764637e5 No dladdr: /usr/lib64/libc.so.6(__libc_start_main+0xe5) [0x7fd6764637e5]
    18 0x43b99e No dladdr: /tmp/geos-build/tests/testLifoStorage(_start+0x2e) [0x43b99e]
rrsettgast commented 2 months ago

@sframba @acitrain @jiemeng-total Is someone available to look into this? It is becoming an issue with high frequency of failed tests.

sframba commented 2 months ago

@sframba @acitrain @jiemeng-total Is someone available to look into this? It is becoming an issue with high frequency of failed tests.

we're on it, I hope we can fix this quickly

sframba commented 2 months ago

If I'm not mistaken, it seems that only the clang build fails, not the gcc one (same cuda version)

sframba commented 2 months ago

Seems hard to reproduce, so far I never got the test failing on RockyLinux: https://github.com/GEOS-DEV/GEOS/actions/runs/10962440334/job/30441863550

sframba commented 1 month ago

@rrsettgast did you notice an improvement after #3362 ? If so, we can maybe close the issue

rrsettgast commented 1 month ago

Hello. The problem still occurs. I am not at a computer now but if you look at the recent actions you should see failure

CusiniM commented 1 month ago

Hello. The problem still occurs. I am not at a computer now but if you look at the recent actions you should see failure

It failed again on this develop run.

https://github.com/GEOS-DEV/GEOS/actions/runs/11033321924

sframba commented 1 month ago

Ok, it seems that it's always the LifoStorageBufferOnCUDANoDeviceBuffer test that's failing. Maybe we can disable it for now