LLNL / CHAI

Copy-hiding array abstraction to automatically migrate data between memory spaces
BSD 3-Clause "New" or "Revised" License
104 stars 22 forks source link

Tabale reproducer #281

Open liu15 opened 6 days ago

liu15 commented 6 days ago

To reproduce crash on rzadams:

mkdir build_rocm cd build_rocm cmake -DCHAI_ENABLE_REPRODUCERS=1 -C ../configs/lc/toss_4_x86_64_ib_cray/amdclang.cmake .. flux alloc -N 1 -n 1 -g1 make -j 40 flux alloc -N 1 -n 1 -g1 ./bin/managed_ptr_multiple_inheritance_reproducer.exe

The rzansel case does not crash and has consistent pointer addresses in YofXfromRTTable1D. This can be reproduced with

mkdir build_cuda cd build_cuda cmake -DCHAI_ENABLE_REPRODUCERS=1 -C ../configs/lc/blueos_3_ppc64le_ib_p9/nvcc_clang.cmake .. lalloc 1 make -j 40 lalloc 1 ./bin/managed_ptr_multiple_inheritance_reproducer.exe

dtaller commented 14 hours ago

To reproduce crash on rzadams:

mkdir build_rocm cd build_rocm cmake -DCHAI_ENABLE_REPRODUCERS=1 -C ../configs/lc/toss_4_x86_64_ib_cray/amdclang.cmake .. flux alloc -N 1 -n 1 -g1 make -j 40 flux alloc -N 1 -n 1 -g1 ./bin/managed_ptr_multiple_inheritance_reproducer.exe

* Note that the "this" pointer in YofXfromRTTable1D::RootFromBaseX differs from the "this" pointer in other YofXfromRTTable1D methods.

* Interestingly , the crash goes away if GetNumStrings() is removed

The rzansel case does not crash and has consistent pointer addresses in YofXfromRTTable1D. This can be reproduced with

mkdir build_cuda cd build_cuda cmake -DCHAI_ENABLE_REPRODUCERS=1 -C ../configs/lc/blueos_3_ppc64le_ib_p9/nvcc_clang.cmake .. lalloc 1 make -j 40 lalloc 1 ./bin/managed_ptr_multiple_inheritance_reproducer.exe

@liu15 . I would put somthing like this (explanation of the reproducer and what systems it fails on and how) as a comment at the top of the reproducer or in a README file somewhere. Those details would be useful to figuring out what it wrong, so it would be nice to have it documented there