Trying DG-Legion with collective

bandokihiro commented 1 year ago

I played with the collective branch on sapling. Since the mode of failure changed compared to last time I tried (mapper error) and for future debugging if needed, I'll report the errors I am triggering here.

The backtrace associated with the current mode of failure is the following:

#0  0x00007f36606dd23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f362ddd6e20, rem=rem@entry=0x7f362ddd6e20) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007f36606e2ec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7f362ddd6e20, remaining=remaining@entry=0x7f362ddd6e20) at nanosleep.c:27
#2  0x00007f36606e2dfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x000055a9a2096e05 in Realm::realm_freeze (signal=6) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/runtime_impl.cc:179
#4  <signal handler called>
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6  0x00007f3660622859 in __GI_abort () at abort.c:79
#7  0x00007f3660622729 in __assert_fail_base (fmt=0x7f36607b8588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55a9a2d98ba0 "finder != collective_analyses.end()", file=0x55a9a2d97a68 "/home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc", line=2798, function=<optimized out>) at assert.c:92
#8  0x00007f3660633fd6 in __GI___assert_fail (assertion=0x55a9a2d98ba0 "finder != collective_analyses.end()", file=0x55a9a2d97a68 "/home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc", line=2798, function=0x55a9a2d98be8 "void Legion::Internal::IndividualView::unregister_collective_analysis(size_t, unsigned int)") at assert.c:101
#9  0x000055a9a0ec50ba in Legion::Internal::IndividualView::unregister_collective_analysis (this=0x7f27f8024590, context_index=96, region_index=0) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc:2798
#10 0x000055a9a0edd313 in Legion::Internal::CollectiveView::finalize_collective_user (this=0x7f27f801a1c0, usage=..., user_mask=..., expr=0x7f28140366a0, op_id=494, op_ctx_index=96, index=0, collect_event=..., local_registered=..., global_registered=..., local_applied=..., global_applied=..., ready_events=std::vector of length 1, capacity 1 = {...}, term_events=std::vector of length 1, capacity 1 = {...}, trace_info=0x7f27f404ac30, symbolic=false) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc:8536
#11 0x000055a9a0edc61c in Legion::Internal::CollectiveView::process_register_user_request (this=0x7f27f801a1c0, op_ctx_index=96, index=0, registered=..., applied=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc:8364
#12 0x000055a9a0edc96c in Legion::Internal::CollectiveView::handle_register_user_request (runtime=0x55ae89467400, derez=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/legion_views.cc:8411
#13 0x000055a9a1085649 in Legion::Internal::Runtime::handle_collective_user_request (this=0x55ae89467400, derez=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:24402
#14 0x000055a9a10530be in Legion::Internal::VirtualChannel::handle_messages (this=0x7f27fc032640, num_messages=1, runtime=0x55ae89467400, remote_address_space=1, args=0x7f27f000ec00 "", arglen=52) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:12356
#15 0x000055a9a1051ff3 in Legion::Internal::VirtualChannel::process_message (this=0x7f27fc032640, args=0x7f27f000ebe4, arglen=72, runtime=0x55ae89467400, remote_address_space=1) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:11703
#16 0x000055a9a10544e9 in Legion::Internal::MessageManager::receive_message (this=0x7f27fc003850, args=0x7f27f000ebe0, arglen=80) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:13335
#17 0x000055a9a108a168 in Legion::Internal::Runtime::process_message_task (this=0x55ae89467400, args=0x7f27f000ebdc, arglen=84) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:26138
#18 0x000055a9a10a23d0 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f27f000ebd0, arglen=88, userdata=0x55ae8b87a890, userlen=8, p=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/legion/runtime.cc:31960
#19 0x000055a9a207bc0a in Realm::LocalTaskProcessor::execute_task (this=0x55ae8944c220, func_id=4, task_args=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/proc_impl.cc:1135
#20 0x000055a9a20e2dff in Realm::Task::execute_on_processor (this=0x7f326005dbf0, p=...) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/tasks.cc:302
#21 0x000055a9a20e705a in Realm::KernelThreadTaskScheduler::execute_task (this=0x55ae8944c560, task=0x7f326005dbf0) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/tasks.cc:1366
#22 0x000055a9a20e5d9e in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55ae8944c560) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/tasks.cc:1105
#23 0x000055a9a20e63ef in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55ae8944c560) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/tasks.cc:1217
#24 0x000055a9a20ee584 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55ae8944c560) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/threads.inl:97
#25 0x000055a9a20fb85e in Realm::KernelThread::pthread_entry (data=0x7f27f4040f00) at /home/bandokihiro/scratch/Builds_GPU/legion/runtime/realm/threads.cc:774
#26 0x00007f366b7b5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#27 0x00007f366071f133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The command line was the following

mpirun --report-bindings -n 4 -npernode 4 --map-by numa:PE=5 ./wrapper.sh /scratch2/bandokihiro/Builds_GPU/DG-Legion/build_cr/exec/solver -logfile logs/log_%.log -hdf5:forcerw -dm:same_address_space -ll:cpu 0 -ll:util 1 -ll:bgwork 1 -ll:bgworkpin 1 -ll:csize 20000 -ll:rsize 0 -ll:gpu 1 -ll:fsize 14000 -ll:onuma 0 -ll:ht_sharing 1 -ll:ocpu 1 -ll:othr 1 -ll:ib_rsize 512m -ll:ib_zsize 0 -cuda:legacysync 1 -lg:warn -lg:partcheck -lg:safe_ctrlrepl 1 -lg:safe_mapper -ll:force_kthreads

lightsighter commented 1 year ago

Which wrapper.sh? You have 387 of them in your scratch directory on sapling.

bandokihiro commented 1 year ago

Please do

cd /scratch2/bandokihiro/Issue1344/run
sbatch runslurm.sh

lightsighter commented 1 year ago

Pull and try again.

bandokihiro commented 1 year ago

This error is not triggered anymore. Without safe control replication checks, it runs to completion. With level 1, I run into this which doesn't happen on control_replication:

[0 - 7ff8e1cd6000]    3.732217 {5}{runtime}: Detected control replication violation when invoking destroy_index_space in task top_level_task (UID 12) on shard 0 [Provenance: unknown]. The hash summary for the function does not align with the hash summaries from other call sites. We'll run the hash algorithm again to try to recognize what value differs between the shards, hang tight...
LEGION ERROR: Specific control replication violation occurred from member handle (from file /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:13342)

lightsighter commented 1 year ago

That seems like a real error. You need to delete index spaces in the same order across your shards.

bandokihiro commented 1 year ago

I am unable to locate the offending call. I commented all destroy_index_space calls in the app and it still triggers, though after completion of most of the program. This looks like it happens when the runtime tears down things? I moved the run folder to /home/bandokihiro/scratch/Issue1344/run/1Node.

lightsighter commented 1 year ago

Pull and try again.

bandokihiro commented 1 year ago

Same error with the following backtrace:

#0  0x00007f9d9cccb23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f9d6b0a5fe0, rem=rem@entry=0x7f9d6b0a5fe0)                                                                                                                                                                                                              
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78                                                                                                                                                                                                                                                                                                                     
#1  0x00007f9d9ccd0ec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7f9d6b0a5fe0, remaining=remaining@entry=0x7f9d6b0a5fe0) at nanosleep.c:27                                                                                                                                                                                                                
#2  0x00007f9d9ccd0dfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55                                                                                                                                                                                                                                                                                               
#3  0x000055d5a7fec6b5 in Realm::realm_freeze (signal=6) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/runtime_impl.cc:179                                                                                                                                                                                                                                   
#4  <signal handler called>                                                                                                                                                                                                                                                                                                                                                
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50                                                                                                                                                                                                                                                                                                  
#6  0x00007f9d9cc10859 in __GI_abort () at abort.c:79                                                                                                                                                                                                                                                                                                                      
#7  0x000055d5a6ff78f1 in Legion::Internal::Runtime::report_error_message (id=607,                                                                                                                                                                                                                                                                                         
    file_name=0x55d5a8ebbfc0 "/home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc", line=13340,                                                                                                                                                                                                                                                    
    message=0x7f9d6b0a6be0 "Specific control replication violation occurred from member handle") at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/runtime.cc:31733                                                                                                                                                                                             
#8  0x000055d5a79d3e9c in Legion::Internal::ReplicateContext::verify_hash (this=0x7f8f3c073c00, hash=0x7f9d6b0a7c50, description=0x55d5a8ec5f46 "handle", provenance=0x0,                                                                                                                                                                                                  
    verify_every_call=true) at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:13340                                                                                                                                                                                                                                                           
#9  0x000055d5a6ffb421 in Legion::Internal::Murmur3Hasher::verify (this=0x7f9d6b0a7d70, description=0x55d5a8ec5f46 "handle", every_call=true)                                                                                                                                                                                                                              
    at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_utilities.h:2006                                                                                                                                                                                                                                                                                   
#10 0x000055d5a7a25a49 in Legion::Internal::Murmur3Hasher::hash<Legion::IndexSpace> (this=0x7f9d6b0a7d70, value=..., description=0x55d5a8ec5f46 "handle")                                                                                                                                                                                                                  
    at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_utilities.h:1863                                                                                                                                                                                                                                                                                   
#11 0x000055d5a79d92ee in Legion::Internal::ReplicateContext::destroy_index_space (this=0x7f8f3c073c00, handle=..., unordered=false, recurse=true, provenance=0x0)                                                                                                                                                                                                         
    at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:14162                                                                                                                                                                                                                                                                                   
#12 0x000055d5a79ca8ef in Legion::Internal::InnerContext::end_task (this=0x7f8f3c073c00, res=0x0, res_size=0, owned=false, deferred_result_instance=..., callback_functor=0x0,                                                                                                                                                                                             
    resource=0x0, freefunc=0x0, metadataptr=0x0, metadatasize=0) at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:11423                                                                                                                                                                                                                      
#13 0x000055d5a79faa8f in Legion::Internal::ReplicateContext::end_task (this=0x7f8f3c073c00, res=0x0, res_size=0, owned=false, deferred_result_instance=..., callback_functor=0x0,                                                                                                                                                                                         
    resource=0x0, freefunc=0x0, metadataptr=0x0, metadatasize=0) at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion_context.cc:19677                                                                                                                                                                                                                      
#14 0x000055d5a6b25d3b in Legion::Runtime::legion_task_postamble (ctx=0x7f8f3c073c00, retvalptr=0x0, retvalsize=0, owned=false, inst=..., metadataptr=0x0, metadatasize=0)                                                                                                                                                                                                 
    at /home/bandokihiro/scratch/Issue1344/legion/runtime/legion/legion.cc:8042                                                                                                                                                                                                                                                                                            
#15 0x000055d5a66b6da5 in Legion::LegionTaskWrapper::legion_task_wrapper<&(top_level_task(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*))> (args=0x7f9d4c01b670, arglen=8, userdata=0x0, userlen=0, p=...)                                                    
    at /scratch2/bandokihiro/Issue1344/legion/install/include/legion/legion.inl:20346                                                                                                                                                                                                                                                                                      
#16 0x000055d5a7fd14ba in Realm::LocalTaskProcessor::execute_task (this=0x55da8ddc0e30, func_id=12, task_args=...)                                                                                                                                                                                                                                                         
    at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/proc_impl.cc:1135                                                                                                                                                                                                                                                                                          
#17 0x000055d5a80386af in Realm::Task::execute_on_processor (this=0x7f9d4c01b4f0, p=...) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/tasks.cc:302                                                                                                                                                                                                          
#18 0x000055d5a803c90a in Realm::KernelThreadTaskScheduler::execute_task (this=0x55da8ddc1140, task=0x7f9d4c01b4f0)                                                                                                                                                                                                                                                        
    at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/tasks.cc:1366                                                                                                                                                                                                                                                                                              
#19 0x000055d5a803b64e in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55da8ddc1140) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/tasks.cc:1105                                                                                                                                                                                                     
#20 0x000055d5a803bc9f in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55da8ddc1140) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/tasks.cc:1217                                                                                                                                                                                               
#21 0x000055d5a8043e34 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55da8ddc1140)                                                                                                                                                                                                      
    at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/threads.inl:97                                                                                                                                                                                                                                                                                             
#22 0x000055d5a805110e in Realm::KernelThread::pthread_entry (data=0x55da8ddee280) at /home/bandokihiro/scratch/Issue1344/legion/runtime/realm/threads.cc:774                                                                                                                                                                                                              
#23 0x00007f9da7da3609 in start_thread (arg=<optimized out>) at pthread_create.c:477                                                                                                                                                                                                                                                                                       
#24 0x00007f9d9cd0d133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

lightsighter commented 1 year ago

Pull and try again.

bandokihiro commented 1 year ago

It's fixed, thanks. I know it might be too early to try this but I have been trying a couple of runs on Summit with tracing enabled. The cases with 1, 2, and 4 nodes worked. The 8-node case failed and I reran it with debug mode. The following assertion was triggered:

#8  0x0000000012346240 in Legion::Internal::Runtime::find_messenger (this=0x209b60f0, sid=865271264)  at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:20770
20770         assert(sid < LEGION_MAX_NUM_NODES);
(gdb) p sid
$1 = 865271264

lightsighter commented 1 year ago

I know there is still unimplemented functionality in tracing in the collective branch, and I'm sure there are bugs because I haven't tested it much yet. Still get a backtrace for the assertion though since it will need to be fixed eventually.

bandokihiro commented 1 year ago

I have found 2 failure modes. I think the app has not yet reached the explicitly traced section yet.

Failure 1:

Thread 12 (Thread 0x200052acf890 (LWP 864596)):
#0  0x000020000a7ca114 in nanosleep () from /lib64/power9/libc.so.6
#1  0x000020000a7c9f44 in sleep () from /lib64/power9/libc.so.6
#2  0x000000001396336c in Realm::realm_freeze (signal=6) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/runtime_impl.cc:179
#3  <signal handler called>
#4  0x000020000a723618 in raise () from /lib64/power9/libc.so.6
#5  0x000020000a703a2c in abort () from /lib64/power9/libc.so.6
#6  0x000020000a716f70 in __assert_fail_base () from /lib64/power9/libc.so.6
#7  0x000020000a717014 in __assert_fail () from /lib64/power9/libc.so.6
#8  0x0000000012f69214 in Legion::Internal::EquivalenceSet::process_replication_response (this=0x200079023460, owner=19) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:12090
#9  0x0000000012f8d1e8 in Legion::Internal::EquivalenceSet::handle_replication_response (derez=..., runtime=0x293d34d0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:17383
#10 0x00000000123566a4 in Legion::Internal::Runtime::handle_equivalence_set_replication_response (this=0x293d34d0, derez=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:24951
#11 0x000000001231a850 in Legion::Internal::VirtualChannel::handle_messages (this=0x200078427a90, num_messages=1, runtime=0x293d34d0, remote_address_space=0, args=0x20005c18d500 "", arglen=28) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:12794
#12 0x00000000123189bc in Legion::Internal::VirtualChannel::process_message (this=0x200078427a90, args=0x20005c18d4e4, arglen=48, runtime=0x293d34d0, remote_address_space=0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:11712
#13 0x000000001231b7dc in Legion::Internal::MessageManager::receive_message (this=0x20007821d6a0, args=0x20005c18d4e0, arglen=56) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:13344
#14 0x000000001235b1dc in Legion::Internal::Runtime::process_message_task (this=0x293d34d0, args=0x20005c18d4dc, arglen=60) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:26035
#15 0x00000000123790c0 in Legion::Internal::Runtime::legion_runtime_task (args=0x20005c18d4d0, arglen=64, userdata=0x293d0190, userlen=8, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:31932
#16 0x000000001393af20 in Realm::LocalTaskProcessor::execute_task (this=0x292de7c0, func_id=4, task_args=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/proc_impl.cc:1135
#17 0x00000000139d4e68 in Realm::Task::execute_on_processor (this=0x20005c9a2b90, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:302
#18 0x00000000139db2a4 in Realm::UserThreadTaskScheduler::execute_task (this=0x2881dbc0, task=0x20005c9a2b90) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1632
#19 0x00000000139d8950 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x2881dbc0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1105
#20 0x00000000139e5cd0 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x2881dbc0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.inl:97
#21 0x00000000139fc7d0 in Realm::UserThread::uthread_entry () at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.cc:1337
#22 0x000020000a737ffc in makecontext () from /lib64/power9/libc.so.6
#23 0x0000200c08000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

gdb0.txt

Failure 2:

Thread 11 (Thread 0x200052a6f890 (LWP 428008)):
#0  0x000020000a7ca114 in nanosleep () from /lib64/power9/libc.so.6
#1  0x000020000a7c9f44 in sleep () from /lib64/power9/libc.so.6
#2  0x000000001396336c in Realm::realm_freeze (signal=6) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/runtime_impl.cc:179
#3  <signal handler called>
#4  0x000020000a723618 in raise () from /lib64/power9/libc.so.6
#5  0x000020000a703a2c in abort () from /lib64/power9/libc.so.6
#6  0x000020000a716f70 in __assert_fail_base () from /lib64/power9/libc.so.6
#7  0x000020000a717014 in __assert_fail () from /lib64/power9/libc.so.6
#8  0x0000000012346278 in Legion::Internal::Runtime::find_messenger (this=0x29de6100, sid=0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:20771
#9  0x000000001234e298 in Legion::Internal::Runtime::send_equivalence_set_replication_response (this=0x29de6100, target=0, rez=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:22692
#10 0x0000000012f69330 in Legion::Internal::EquivalenceSet::process_replication_response (this=0x2022341d9e90, owner=33) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:12103
#11 0x0000000012f8d1e8 in Legion::Internal::EquivalenceSet::handle_replication_response (derez=..., runtime=0x29de6100) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:17383
#12 0x00000000123566a4 in Legion::Internal::Runtime::handle_equivalence_set_replication_response (this=0x29de6100, derez=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:24951
#13 0x000000001231a850 in Legion::Internal::VirtualChannel::handle_messages (this=0x20007c245950, num_messages=1, runtime=0x29de6100, remote_address_space=33, args=0x200074eca6a0 "", arglen=28) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:12794
#14 0x00000000123189bc in Legion::Internal::VirtualChannel::process_message (this=0x20007c245950, args=0x200074eca684, arglen=48, runtime=0x29de6100, remote_address_space=33) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:11712
#15 0x000000001231b7dc in Legion::Internal::MessageManager::receive_message (this=0x20007c22de90, args=0x200074eca680, arglen=56) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:13344
#16 0x000000001235b1dc in Legion::Internal::Runtime::process_message_task (this=0x29de6100, args=0x200074eca67c, arglen=60) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:26035
#17 0x00000000123790c0 in Legion::Internal::Runtime::legion_runtime_task (args=0x200074eca670, arglen=64, userdata=0x2998ba30, userlen=8, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:31932
#18 0x000000001393af20 in Realm::LocalTaskProcessor::execute_task (this=0x29d6c740, func_id=4, task_args=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/proc_impl.cc:1135
#19 0x00000000139d4e68 in Realm::Task::execute_on_processor (this=0x200074e672d0, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:302
#20 0x00000000139db2a4 in Realm::UserThreadTaskScheduler::execute_task (this=0x293aa840, task=0x200074e672d0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1632
#21 0x00000000139d8950 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x293aa840) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1105
#22 0x00000000139e5cd0 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x293aa840) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.inl:97
#23 0x00000000139fc7d0 in Realm::UserThread::uthread_entry () at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.cc:1337
#24 0x000020000a737ffc in makecontext () from /lib64/power9/libc.so.6
#25 0x0000000000000000 in ?? ()

gdb2_1.txt Another process on the same node was stuck at the previous line assert(sid < LEGION_MAX_NUM_NODES); which is what I described yesterday.

lightsighter commented 1 year ago

Pull and try again to see if the particular crashes above are gone. I think I pushed a fix for them, but I still have not done the general work for tracing that needs to be done to fully support it.

bandokihiro commented 1 year ago

The above are fixed and I was able to run 8 and 16 nodes. The 32 nodes failed with a similar behavior in relase mode as the issue I described in #1235. I ran in debug mode and here is the backtrace:

Thread 12 (Thread 0x20005793f890 (LWP 266729)):
#0  0x000020000a7ca114 in nanosleep () from /lib64/power9/libc.so.6
#1  0x000020000a7c9f44 in sleep () from /lib64/power9/libc.so.6
#2  0x0000000011839bc0 in gasneti_freezeForDebuggerNow () at /autofs/nccs-svm1_sw/summit/gcc/9.3.0-2/include/c++/9.3.0/ext/new_allocator.h:89
#3  0x000000001484ba0c in gasneti_freezeForDebuggerErr ()
#4  0x0000000011839fdc in gasneti_defaultSignalHandler () at /autofs/nccs-svm1_sw/summit/gcc/9.3.0-2/include/c++/9.3.0/ext/new_allocator.h:89
#5  <signal handler called>
#6  0x000020000a723618 in raise () from /lib64/power9/libc.so.6
#7  0x000020000a703a2c in abort () from /lib64/power9/libc.so.6
#8  0x000020000a716f70 in __assert_fail_base () from /lib64/power9/libc.so.6
#9  0x000020000a717014 in __assert_fail () from /lib64/power9/libc.so.6
#10 0x00000000131b8b38 in Legion::Internal::LayoutDescription::compute_copy_offsets (this=0x200f88d7b3c0, copy_mask=..., instance=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:363
#11 0x00000000131bcdbc in Legion::Internal::PhysicalManager::compute_copy_offsets (this=0x200f88d93da0, copy_mask=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:1141
#12 0x00000000120f9310 in Legion::Internal::IndividualView::copy_from (this=0x200f88d94240, src_view=0x201c5a8f6c20, precondition=..., predicate_guard=..., reduction_op_id=0, copy_expression=0x200f8bbefdc0, op=0x200f88d929c0, index=0, copy_mask=..., src_point=0x201c5a8f67f0, trace_info=..., recorded_events=..., applied_events=..., across_helper=0x0, manage_dst_events=true, copy_restricted=false, need_valid_return=false) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_views.cc:2391
#13 0x0000000012f3fb2c in Legion::Internal::CopyFillAggregator::issue_copies (this=0x201c5a926710, target=0x200f88d94240, copies=..., recorded_events=..., precondition=..., copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5485
#14 0x0000000012f3e900 in Legion::Internal::CopyFillAggregator::perform_updates (this=0x201c5a926710, updates=..., trace_info=..., precondition=..., recorded_events=..., redop_index=-1, manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5311
#15 0x0000000012f3defc in Legion::Internal::CopyFillAggregator::issue_updates (this=0x201c5a926710, trace_info=..., precondition=..., restricted_output=false, manage_dst_events=true, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5153
#16 0x0000000012f4c3c0 in Legion::Internal::UpdateAnalysis::perform_updates (this=0x200f88d94d90, perform_precondition=..., applied_events=..., already_deferred=true) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:7432
#17 0x0000000012f43c24 in Legion::Internal::PhysicalAnalysis::handle_deferred_update (args=0x200f88d95900) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:6163
#18 0x0000000012379b70 in Legion::Internal::Runtime::legion_runtime_task (args=0x200f88d95900, arglen=20, userdata=0x4c2306d0, userlen=8, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/runtime.cc:32377
#19 0x000000001393b1a0 in Realm::LocalTaskProcessor::execute_task (this=0x4c170df0, func_id=4, task_args=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/proc_impl.cc:1135
#20 0x00000000139d50e8 in Realm::Task::execute_on_processor (this=0x200f88d95780, p=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:302
#21 0x00000000139db524 in Realm::UserThreadTaskScheduler::execute_task (this=0x4ad6e8d0, task=0x200f88d95780) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1632
#22 0x00000000139d8bd0 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4ad6e8d0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/tasks.cc:1105
#23 0x00000000139e5f50 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x4ad6e8d0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.inl:97
#24 0x00000000139fca50 in Realm::UserThread::uthread_entry () at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.cc:1337
#25 0x000020000a737ffc in makecontext () from /lib64/power9/libc.so.6
#26 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

lightsighter commented 1 year ago

Attach a debugger and print out the following values in frame 10:

p/x copy_mask
p/x compressed
p/x allocated_fields
p found_in_cache

bandokihiro commented 1 year ago

#8  0x00000000131b8b38 in Legion::Internal::LayoutDescription::compute_copy_offsets (this=0x201c4482bc80, copy_mask=..., instance=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:363
363           assert(pop_count == FieldMask::pop_count(copy_mask));
(gdb) p/x copy_mask
$2 = (const Legion::Internal::FieldMask &) @0x200076238140: {<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>}, static ELEMENT_SIZE = 0x40, static BIT_ELMTS = 0x8, static PPC_ELMTS = 0x4, static MAXSIZE = <optimized out>, bits = {static ELEMENT_SIZE = 0x40, bit_vector = {0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, static SHIFT = <optimized out>, static MASK = <optimized out>}, sum_mask = 0x80}
(gdb) p/x compressed
$3 = {<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>}, static ELEMENT_SIZE = 0x40, static BIT_ELMTS = 0x8, static PPC_ELMTS = 0x4, static MAXSIZE = <optimized out>, bits = {static ELEMENT_SIZE = 0x40, bit_vector = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, static SHIFT = <optimized out>, static MASK = <optimized out>}, sum_mask = 0x0}
(gdb) p/x allocated_fields
$4 = {<BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>}, static ELEMENT_SIZE = 0x40, static BIT_ELMTS = 0x8, static PPC_ELMTS = 0x4, static MAXSIZE = <optimized out>, bits = {static ELEMENT_SIZE = 0x40, bit_vector = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, static SHIFT = <optimized out>, static MASK = <optimized out>}, sum_mask = 0x0}
(gdb) p found_in_cache
$5 = false

lightsighter commented 1 year ago

This also looks like the mapper is doing something weird and making an instance without any fields in it. I can make Legion support that case, but in general instances without any fields in them seem kind of useless. Do you have a reason for making an instance without any fields in it?

bandokihiro commented 1 year ago

There are maybe a few tasks during start-up for which I do that. This was an easy way to get a handle to the logical sub-regions. I remember in the past that I sometimes got errors when I didn't provide any fields. But I don't think I have such tasks when I actually start to iterate which is when it seems to fail. Is there a way for me to understand which region requirement of which task is triggering this?

lightsighter commented 1 year ago

If you go to frame 13 of the backtrace above, then you should be able to tell which task/operation and region requirement caused this particular copy. p this->op will give you a pointer to the operation. You can do p this->op->get_operation_kind() will tell you the kind. Presumably it is a task so then you can do something like p ((TaskOp*)this->op)->get_task_name() and that should tell you the name of the task. If you set the provenance then p this->op->provenance->human should show you the provenance string. You can get the index of the region requirement using p this->index.

I'm still a little bit unsure how we even managed to be using a copy from an instance without any fields. Really that should never be happening. Maybe you'll get some semantic insight into how we ended up with a source instance that has empty fields. This only happens with 32 nodes and not at a smaller scale?

bandokihiro commented 1 year ago

This only happens with 32 nodes and not at a smaller scale?

Yes.

I have a way to dump the re-ordered mesh so that I can skip this step for the next run. If I do that, it ran successfully.

I am trying to follow your instructions, but I am having a hard time making it freeze. I'll let you know when I have more insight.

bandokihiro commented 1 year ago

I went to frame 13 and it said that there is no member or method named op.

lightsighter commented 1 year ago

Did you try this->op? It is definitely there:

https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/legion_analysis.h#L1585

lightsighter commented 1 year ago

Were you able to find the operation using the debugger? If it doesn't work, try loading my gdb init file that can be found under my home directory on sapling. Run this in your debugger:

source .gdbinit

bandokihiro commented 1 year ago

It only freezes once out of 20 times and I haven't been able to make it freeze again since last time. I'll try your gdb init file next time it freezes.

bandokihiro commented 1 year ago

I tried your init file, was still unable to find this->op.

lightsighter commented 1 year ago

What does p this show the type is with my gdb plugin?

bandokihiro commented 1 year ago

(gdb) p this
$2 = (Legion::Internal::CopyFillAggregator * const) 0x201c52499620

lightsighter commented 1 year ago

Ok, I forgot I refactored this. Try:

p this->analysis->op

bandokihiro commented 1 year ago

I got this

#20 0x00000000131b8b38 in Legion::Internal::LayoutDescription::compute_copy_offsets (this=0x201c45171280, copy_mask=..., instance=..., fields=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:363
(gdb) p/x copy_mask
$5 = (const Legion::Internal::FieldMask &) @0x201c4c4e1210: {
  <BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>},
  members of PPCTLBitMask<512>:
  bits = {
    bit_vector = {0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
  },
  sum_mask = 0x80
}
(gdb) p/x compressed
$6 = {
  <BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>},
  members of PPCTLBitMask<512>:
  bits = {
    bit_vector = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
  },
  sum_mask = 0x0
}
(gdb) p/x allocated_fieldsQuit
(gdb) p/x allocated_fields
$7 = {
  <BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>},
  members of PPCTLBitMask<512>:
  bits = {
    bit_vector = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
  },
  sum_mask = 0x0
}
(gdb) p found_in_cache
$8 = false

#23 0x0000000012f3fb2c in Legion::Internal::CopyFillAggregator::issue_copies (this=0x200cdbc43930, target=0x201c5442f6b0, copies=..., recorded_events=..., precondition=..., copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5485
(gdb) p this->analysis->op
$2 = (Legion::Internal::RemoteTaskOp * const) 0x201c54c733b0
(gdb) p ( (RemoteTaskOp*) this->analysis->op )->get_task_name()
$3 = 0x20007c21e2f0 "pack_for_explicit_ghosts_task"
(gdb) p this->analysis->index
$4 = 0

Region requirement 0 of pack_for_explicit_ghosts_task asks for a second-level sub-region (with some fields) in a 2D index task lunch with a custom projection functor: point task (i,j) gets the j-th sub-region of the i-th first-level sub-region. The associated sub-domain can be empty, but the fact that it works if I start from a reordered mesh file makes me think this is not the problem. Point task (i,j) gets mapped to rank j.

lightsighter commented 1 year ago

What is the result of this (try to use my gdb plugin if you can)?

p this->analysis->analysis_expr->is_empty()
p this->analysis->analysis_expr->realm_index_space
p this->analysis->usage

It's pretty surprising to me that we can make it here with an empty copy_mask. The only way I think that can happen is if we have an empty index space expression for the analysis as well (e.g. the index space for the region for this task is empty). The ordering on parallel analyses might also matter. Is this a read-only region requirement?

bandokihiro commented 1 year ago

I got this

#11 0x0000000012f3fb2c in Legion::Internal::CopyFillAggregator::issue_copies (this=0x201c545e7db0, target=0x201c49790cd0, copies=..., recorded_events=..., precondition=...,
    copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0)
    at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5485
warning: Source file is more recent than executable.
5485        //--------------------------------------------------------------------------
(gdb) p this->analysis->analysis_expr->is_empty()
$1 = false
(gdb) p this->analysis->analysis_expr->realm_index_space
$2 = {
  bounds = {
    lo = {
      x = 858842
    },
    hi = {
      x = 858945
    }
  },
  sparsity = {
    id = 0
  }
}
(gdb) p this->analysis->usage
$3 = {
  privilege = LEGION_READ_PRIV,
  prop = LEGION_EXCLUSIVE,
  redop = 0
}

Yes, it's a read-only region requirement.

lightsighter commented 1 year ago

Have you tried running this with the safe mapper checks: -lg:safe_mapper? I'm trying to understand how it is that we were able to record an instance without any fields as a valid copy of the data for this non-empty region (we'd allow it if the region itself were empty).

bandokihiro commented 1 year ago

I added the following flags: -lg:warn -lg:partcheck -lg:safe_ctrlrepl 1 -lg:safe_mapper. Nothing was caught.

lightsighter commented 1 year ago

Can you capture the detailed Legion Spy logs from a failing run. Also get the following from the crashed thread above from the same run:

p this->analysis->op->unique_op_id (frame 13) p this->analysis->index (frame 13) p/x this->instance.id (frame 11) p copy_mask (frame 12)

bandokihiro commented 1 year ago

(gdb) p Realm::Network::my_node_id
$1 = 60
(gdb) up 9
#9  0x000000001319dfc8 in Legion::Internal::PhysicalManager::compute_copy_offsets (this=0x201c3ae2fc50, copy_mask=..., fields=...)
    at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_instances.cc:1132
1132          layout->compute_copy_offsets(copy_mask, instance, fields);
(gdb) p/x this->instance.id
$2 = 0x4010001000800022
(gdb) up
#10 0x00000000120d81dc in Legion::Internal::IndividualView::copy_from (this=0x201c3ae300d0, src_view=0x201c47825be0, precondition=..., predicate_guard=..., reduction_op_id=0,
    copy_expression=0x201c47811910, op=0x201c3ae2f090, index=0, copy_mask=..., src_point=0x201c478257a0, trace_info=..., recorded_events=..., applied_events=...,
    across_helper=0x0, manage_dst_events=true, copy_restricted=false, need_valid_return=false)
    at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_views.cc:2403
2403              manager->compute_copy_offsets(copy_mask, dst_fields);
(gdb) p copy_mask
$3 = (const Legion::Internal::FieldMask &) @0x201c52eb7cc0: {
  <BitMaskHelp::Heapify<PPCTLBitMask<512> >> = {<No data fields>},
  members of PPCTLBitMask<512>:
  bits = {
    bit_vector = {128, 0, 0, 0, 0, 0, 0, 0}
  },
  sum_mask = 128
}
(gdb) up
#11 0x0000000012f2a250 in Legion::Internal::CopyFillAggregator::issue_copies (this=0x201c3af4db20, target=0x201c3ae300d0, copies=..., recorded_events=..., precondition=...,
    copy_mask=..., trace_info=..., manage_dst_events=true, restricted_output=false, dst_events=0x0)
    at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/legion_analysis.cc:5416
5416                                        track_events);
(gdb) p this->analysis->index
$4 = 0
(gdb) p this->analysis->op->unique_op_id
$5 = 128128

bandokihiro commented 1 year ago

The associated logs can be found at /scratch2/bandokihiro/Issue1344/logs_20221202.

lightsighter commented 1 year ago

Ok, it looks like this instance is not actually empty. Somehow the layout description for the instance on this remote node is not picking up the fact that there are valid fields in the instance and that is causing the problem. In frame 10 of the above stack trace, let's see what the output of following show:

p this->allocated_fields p *(this->constraints)

If you can use my gdb plugin for STL data structures, then let's also see what the output of this is:

pvector this->constraints->field_constraint.field_set

If possible do this with Legion Spy logging and get new logs as well as the above data when it crashes. You can overwrite the old log files since they are a lot of data.

bandokihiro commented 1 year ago

Please find the logs at /scratch2/bandokihiro/Issue1344/logs_20221214.

Since gdb's output was rather long, I also left gdb.txt there with the outputs of the commands you wanted me to try.

lightsighter commented 1 year ago

Can you dumb Legion Spy logs of just the part of the code that does the initialization when you're NOT restarting from a previous checkpoint? Do not run any of the simulation looping, we just want the initialization part. Somehow an instance with no fields is being recorded as a valid copy of the data for a field. It seems like it has to happen during the initialization since you can't reproduce the error when you restart from an existing checkpoint.

bandokihiro commented 1 year ago

I put the logs of the initialization phase here /scratch2/bandokihiro/Issue1344/logs_20221217.

lightsighter commented 1 year ago

I pushed a fix for a related bug to the collective branch. Can you pull and see if it fixes your issue as well?

bandokihiro commented 1 year ago

I pulled and it did not fix the issue.

lightsighter commented 1 year ago

Is there gdb output for the files above?

bandokihiro commented 1 year ago

The last set of logs do not contain the iterative part of the solve, so it exited cleanly.

lightsighter commented 1 year ago

Ok, not particularly helpful then since it's hard to track down these instances. Let's try a different tact. Let's apply this patch and see where we're making these layouts with no fields as that should really never be happening. Check to see if you hit either of these assertions at smaller node counts before going up to larger node counts:

diff --git a/runtime/legion/legion_instances.cc b/runtime/legion/legion_instances.cc
index 8189ef92d..a096a10d5 100644
--- a/runtime/legion/legion_instances.cc
+++ b/runtime/legion/legion_instances.cc
@@ -241,6 +241,7 @@ namespace Legion {
       : allocated_fields(mask), constraints(con), owner(own), total_dims(dims)
     //--------------------------------------------------------------------------
     {
+      assert(!!allocated_fields);
       constraints->add_base_gc_ref(LAYOUT_DESC_REF);
       field_infos.resize(field_sizes.size());
       // Switch data structures from layout by field order to order
@@ -271,6 +272,7 @@ namespace Legion {
       : allocated_fields(mask), constraints(con), owner(NULL), total_dims(0)
     //--------------------------------------------------------------------------
     {
+      assert(!!allocated_fields);
       constraints->add_base_gc_ref(LAYOUT_DESC_REF);
     }

bandokihiro commented 1 year ago

I have been trying for a while and I haven't been able to make it hang. Similarly, with GASNET_BACKTRACE=1, the job gets killed before the program has time to invoke gdb.

lightsighter commented 1 year ago

As in it crashes and then doesn't freeze? If REALM_FREEZE_ON_ERROR=1 isn't working then you can also try GASNET_FREEZE_ON_ERROR=1. They are fundamentally doing the same thing though, so I might not expect very different behavior. Usually the reason that it would die like that with the freeze-on-error is if one of the processes actually exits cleanly, and then the runner (e.g. mpirun) will come through and spike the other processes. Is there any evidence of that happening (mpirun will usually tell you that it's doing it)? Other than that, the only other thing that might be happening is an OOM error where the OS is spiking a process, but I would be surprised if that is happening in this case.

The original failure mode though is still reproducible?

bandokihiro commented 1 year ago

The original failure mode was already difficult to make hang. The assertion that we added has been even more difficult (I wasn't able to make it hang once). I see the "going to freeze" message for a subset of processes, then the job gets killed a few seconds later. I tried both the realm and the gasnet env variable (from my experience, gasnet's env variable is a bit more reliable). I don't think one of the processes can exit cleanly since data need to move around at each iteration and I am setting up the problem for 20 iterations.

lightsighter commented 1 year ago

Are you still observing this bug?

bandokihiro commented 1 year ago

Yes. I ran in release mode and observed the same failure mode as documented in #1235.

/gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/transfer/transfer.cc:4015: void Realm::TransferDesc::perform_analysis(): Assertion `srcs.size() == dsts.size()' failed.

My debugging attempt stopped when I wasn't able to make the program hang with the above patch that you gave me.

StanfordLegion / legion

Trying DG-Legion with collective #1344