StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

[HTR] device-side assert triggered #1687

Closed cmelone closed 2 months ago

cmelone commented 2 months ago

running a problem on 1 gpu, debug mode

error:

[0 - 7f164a90fc80]  886.257289 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/cuda/cuda_module.cc:401: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.

backtraces:

#0  0x00007f167d92b9fd in nanosleep () from /lib64/libc.so.6
#1  0x00007f167d92b894 in sleep () from /lib64/libc.so.6
#2  0x00007f168105c086 in Realm::realm_freeze (signal=6) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/runtime_impl.cc:206
#3  <signal handler called>
#4  0x00007f167d89c387 in raise () from /lib64/libc.so.6
#5  0x00007f167d89da78 in abort () from /lib64/libc.so.6
#6  0x00007f167d8951a6 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007f167d895252 in __assert_fail () from /lib64/libc.so.6
#8  0x00007f168147ebd3 in Realm::Cuda::GPUStream::reap_events (this=0x37ddc60, work_until=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/cuda/cuda_module.cc:401
#9  0x00007f1681488bcb in Realm::Cuda::GPUWorker::do_work (this=0x32f77b0, work_until=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/cuda/cuda_module.cc:2424
#10 0x00007f16810977c1 in Realm::BackgroundWorkManager::Worker::do_work (this=0x4000538, max_time_in_ns=100, interrupt_flag=0x4000598)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/bgwork.cc:599
#11 0x00007f16811cf9a3 in Realm::ThreadedTaskScheduler::wait_for_work (this=0x4000320, old_work_counter=6641) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1291
#12 0x00007f16811d0877 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x4000320, old_work_counter=6641) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1528
#13 0x00007f16811cf814 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4000320) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1260
#14 0x00007f16811cf8d2 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x4000320) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1272
#15 0x00007f16811d6b4a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x4000320)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/threads.inl:97
#16 0x00007f16811a6e73 in Realm::KernelThread::pthread_entry (data=0x7f1600051c40) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/threads.cc:831
#17 0x00007f167d449ea5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f167d964b0d in clone () from /lib64/libc.so.6
#0  0x00007f167d92b9fd in nanosleep () from /lib64/libc.so.6
#1  0x00007f167d92b894 in sleep () from /lib64/libc.so.6
#2  0x00007f168105c086 in Realm::realm_freeze (signal=6) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/runtime_impl.cc:206
#3  <signal handler called>
#4  0x00007f167d89c387 in raise () from /lib64/libc.so.6
#5  0x00007f167d89da78 in abort () from /lib64/libc.so.6
#6  0x00007f167d8951a6 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007f167d895252 in __assert_fail () from /lib64/libc.so.6
#8  0x00007f168147ebd3 in Realm::Cuda::GPUStream::reap_events (this=0x37dd0e0, work_until=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/cuda/cuda_module.cc:401
#9  0x00007f1681488bcb in Realm::Cuda::GPUWorker::do_work (this=0x32f77b0, work_until=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/cuda/cuda_module.cc:2424
#10 0x00007f16810977c1 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f16642de0f0, max_time_in_ns=-1, interrupt_flag=0x0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/bgwork.cc:599
#11 0x00007f1681095563 in Realm::BackgroundWorkThread::main_loop (this=0x4103660) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/bgwork.cc:103
#12 0x00007f1681098cd0 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x4103660)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/threads.inl:97
#13 0x00007f16811a6e73 in Realm::KernelThread::pthread_entry (data=0x3fff600) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/threads.cc:831
#14 0x00007f167d449ea5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f167d964b0d in clone () from /lib64/libc.so.6
note to self: 3d periodic 256x128x128, crashes ~13th timestep
lightsighter commented 2 months ago

I'm not sure what you want us to do with that. Device side asserts are always application code.

elliottslaughter commented 2 months ago

@cmelone Is this something you can bisect to a specific commit?

cmelone commented 2 months ago

I erroneously used an input that was unfeasible to execute, not a Legion issue