StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

[HTR] Segmentation faults at 16 nodes #1420

Closed cmelone closed 7 months ago

cmelone commented 1 year ago

I am able to run one of our applications (16 ranks, 1 rank per node) on Legion commit cba415a857c2586b2ad2f4848d6d1cd75de7df00.

However, on 9c6c90b9e3857196da2659a29140f2d7686832bb, I get segmentation faults and non-deterministic errors such as:

prometeo_ConstPropMix.exec: prometeo_variables.cc:75: static void UpdatePropertiesFromPrimitiveTask::cpu_base_impl(const UpdatePropertiesFromPrimitiveTask::Args&, const std::vector<Legion::PhysicalRegion>&, const std::vector<Legion::Future>&, Legion::Context, Legion::Runtime*): Assertion `args.mix.CheckMixture(acc_MolarFracs[p])' failed.
[5 - 7fbc93ba8840] 1193.644387 {6}{realm}: invalid event handle: id=7fbcab057570
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.cc:2509: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.

This program does run successfully with DEBUG=1. I am actively running this test case with smaller configurations to see if I can reproduce outside of this specific config.

Edit:

16 ranks, 4 ranks per node works

lightsighter commented 1 year ago

I will take any reproducer if you can get it to reproduce on sapling.

cmelone commented 1 year ago

I don't have an account on Sapling. Is is possible to get access? Otherwise will need to wait for Mario. Edit: I'll submit a request.

cmelone commented 1 year ago

Process:

Process 143173 on node c0001.stanford.edu is frozen!

Build:

module load slurm
srun -p cpu --pty bash
cd /home/cmelone/0507
REBUILD=1 ./compile.sh

Run:

# from login node
cd /home/cmelone/0507
REBUILD=0 ./compile.sh

Let me know if you need more info/permissions to the directories.

elliottslaughter commented 1 year ago

For @lightsighter, this is on the new sapling2 cluster.

lightsighter commented 1 year ago

Pull and try again.

cmelone commented 1 year ago

Note that Legion is now located in /home/cmelone/lg, had to reinstall a dependency.

prometeo_ConstPropMix.exec: /home/cmelone/lg/runtime/legion/runtime.cc:27094: Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceSet; Legion::Internal::MessageKind MK = Legion::Internal::SEND_EQUIVALENCE_SET_REQUEST; Legion::DistributedID = long long unsigned int]: Assertion `target != address_space' failed.
...
Process 69822 on node c0004.stanford.edu is frozen!
Process 72485 on node c0003.stanford.edu is frozen!
#0  0x00007f215bb7623f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f1fe4045920, rem=rem@entry=0x7f1fe4045920)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007f215bb7bec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7f1fe4045920, remaining=remaining@entry=0x7f1fe4045920) at nanosleep.c:27
#2  0x00007f215bb7bdfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007f215eea34ef in Realm::realm_freeze (signal=6) at /home/cmelone/lg/runtime/realm/runtime_impl.cc:187
#4  <signal handler called>
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6  0x00007f215babb859 in __GI_abort () at abort.c:79
#7  0x00007f215babb729 in __assert_fail_base (fmt=0x7f215bc51588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7f215fa402e6 "target != address_space",
    file=0x7f215fa29240 "/home/cmelone/lg/runtime/legion/runtime.cc", line=27094, function=<optimized out>) at assert.c:92
#8  0x00007f215baccfd6 in __GI___assert_fail (assertion=0x7f215fa402e6 "target != address_space", file=0x7f215fa29240 "/home/cmelone/lg/runtime/legion/runtime.cc", line=27094,
    function=0x7f215fa40aa0 "Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceS"...) at assert.c:101
#9  0x00007f215e882601 in Legion::Internal::Runtime::find_or_request_distributed_collectable<Legion::Internal::EquivalenceSet, (Legion::Internal::MessageKind)175> (
    this=0x5605d70e12a0, to_find=1008806316531255313, ready=...) at /home/cmelone/lg/runtime/legion/runtime.cc:27094
#10 0x00007f215e840d78 in Legion::Internal::Runtime::find_or_request_equivalence_set (this=0x5605d70e12a0, did=1008806316531255313, ready=...)
    at /home/cmelone/lg/runtime/legion/runtime.cc:27017
#11 0x00007f215e5b2d51 in Legion::Internal::EquivalenceSet::handle_clone_response (derez=..., runtime=0x5605d70e12a0) at /home/cmelone/lg/runtime/legion/legion_analysis.cc:18982
#12 0x00007f215e83a627 in Legion::Internal::Runtime::handle_equivalence_set_clone_response (this=0x5605d70e12a0, derez=...) at /home/cmelone/lg/runtime/legion/runtime.cc:25281
#13 0x00007f215e807953 in Legion::Internal::VirtualChannel::handle_messages (this=0x7f2034000bd0, num_messages=1, runtime=0x5605d70e12a0, remote_address_space=0,
    args=0x7f206ae1dc60 "", arglen=120) at /home/cmelone/lg/runtime/legion/runtime.cc:13063
#14 0x00007f215e805fbb in Legion::Internal::VirtualChannel::process_message (this=0x7f2034000bd0, args=0x7f206ae1dc44, arglen=140, runtime=0x5605d70e12a0, remote_address_space=0)
    at /home/cmelone/lg/runtime/legion/runtime.cc:11909
#15 0x00007f215e80857f in Legion::Internal::MessageManager::receive_message (this=0x7f2034000ba0, args=0x7f206ae1dc40, arglen=148) at /home/cmelone/lg/runtime/legion/runtime.cc:13587
#16 0x00007f215e83dec2 in Legion::Internal::Runtime::process_message_task (this=0x5605d70e12a0, args=0x7f206ae1dc3c, arglen=152) at /home/cmelone/lg/runtime/legion/runtime.cc:26326
#17 0x00007f215e85572c in Legion::Internal::Runtime::legion_runtime_task (args=0x7f206ae1dc30, arglen=156, userdata=0x5605d70df250, userlen=8, p=...)
    at /home/cmelone/lg/runtime/legion/runtime.cc:32056
#18 0x00007f215f20fb6a in Realm::LocalTaskProcessor::execute_task (this=0x5605d608c260, func_id=4, task_args=...) at /home/cmelone/lg/runtime/realm/proc_impl.cc:1129
#19 0x00007f215f017681 in Realm::Task::execute_on_processor (this=0x7f207363c780, p=...) at /home/cmelone/lg/runtime/realm/tasks.cc:302
#20 0x00007f215f01b8ca in Realm::KernelThreadTaskScheduler::execute_task (this=0x5605d5bae330, task=0x7f207363c780) at /home/cmelone/lg/runtime/realm/tasks.cc:1366
#21 0x00007f215f01a614 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x5605d5bae330) at /home/cmelone/lg/runtime/realm/tasks.cc:1105
#22 0x00007f215f01ac65 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x5605d5bae330) at /home/cmelone/lg/runtime/realm/tasks.cc:1217
#23 0x00007f215f023350 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x5605d5bae330)
    at /home/cmelone/lg/runtime/realm/threads.inl:97
#24 0x00007f215efefb0c in Realm::KernelThread::pthread_entry (data=0x7f1fe40051a0) at /home/cmelone/lg/runtime/realm/threads.cc:781
#25 0x00007f215ba72609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#26 0x00007f215bbb8133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
lightsighter commented 1 year ago

Rebuild with -DLEGION_GC and run with -level legion_gc=2 to produce log files and a hanging process.

cmelone commented 1 year ago

Log files located at

/home/cmelone/0507/out/8x8x2Run-3d854ac9-8946-4e6e-94f9-acd65553c50e/

PIDs:

Process 197057 on node c0001.stanford.edu is frozen!
Process 3327 on node c0002.stanford.edu is frozen!
Process 90000 on node c0003.stanford.edu is frozen!
Process 86603 on node c0004.stanford.edu is frozen!
lightsighter commented 1 year ago

You can kill this job. What task(s) in the task tree or you control replicating? What levels are they at?

mariodirenzo commented 1 year ago

What task(s) in the task tree or you control replicating? What levels are they at?

The control replicated task is called workSingle and it is launched by the top level task called main

lightsighter commented 1 year ago

What region arguments are there to workSingle? Are any of them virtually mapped?

mariodirenzo commented 1 year ago

There aren't any region arguments. It is an inner task and defines all the regions inside it

cmelone commented 1 year ago

new error:

legion/region_tree.cc:10852: static void Legion::Internal::IndexPartNode::handle_node_request(Legion::Internal::RegionTreeForest*, Legion::Deserializer&): Assertion `!target->collective_mapping->contains(source)' failed.
#0  0x00007efd05e5e9fd in nanosleep () from /lib64/libc.so.6
#1  0x00007efd05e5e894 in sleep () from /lib64/libc.so.6
#2  0x00007efd09225bc6 in Realm::realm_freeze (signal=6) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/runtime_impl.cc:187
#3  <signal handler called>
#4  0x00007efd05dcf387 in raise () from /lib64/libc.so.6
#5  0x00007efd05dd0a78 in abort () from /lib64/libc.so.6
#6  0x00007efd05dc81a6 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007efd05dc8252 in __assert_fail () from /lib64/libc.so.6
#8  0x00007efd08b00b1f in Legion::Internal::IndexPartNode::handle_node_request (forest=0x31031d0, derez=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/region_tree.cc:10852
#9  0x00007efd08be5c19 in Legion::Internal::Runtime::handle_index_partition_request (this=0x30fb800, derez=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:23954
#10 0x00007efd08bb7c0e in Legion::Internal::VirtualChannel::handle_messages (this=0x7efc4c07c500, num_messages=1, runtime=0x30fb800, remote_address_space=14,
    args=0x7efc581fd910 "\035", arglen=24) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:12161
#11 0x00007efd08bb7276 in Legion::Internal::VirtualChannel::process_message (this=0x7efc4c07c500, args=0x7efc581fd8f4, arglen=44, runtime=0x30fb800, remote_address_space=14)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:11891
#12 0x00007efd08bb97f3 in Legion::Internal::MessageManager::receive_message (this=0x7efc4c001dd0, args=0x7efc581fd8f0, arglen=52)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:13572
#13 0x00007efd08beaea0 in Legion::Internal::Runtime::process_message_task (this=0x30fb800, args=0x7efc581fd8ec, arglen=56)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:26322
#14 0x00007efd08c003b8 in Legion::Internal::Runtime::legion_runtime_task (args=0x7efc581fd8e0, arglen=60, userdata=0x30fae70, userlen=8, p=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:32052
#15 0x00007efd095374a0 in Realm::LocalTaskProcessor::execute_task (this=0x20b0d00, func_id=4, task_args=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/proc_impl.cc:1129
#16 0x00007efd093747e4 in Realm::Task::execute_on_processor (this=0x7efc58214870, p=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:302
#17 0x00007efd09378568 in Realm::KernelThreadTaskScheduler::execute_task (this=0x20b0ff0, task=0x7efc58214870)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1366
#18 0x00007efd093773df in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x20b0ff0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1105
#19 0x00007efd09377a02 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x20b0ff0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1217
#20 0x00007efd0937ed3a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x20b0ff0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/threads.inl:97
#21 0x00007efd09350ad7 in Realm::KernelThread::pthread_entry (data=0x7ef9f0002080) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/threads.cc:781
#22 0x00007efd0597cea5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007efd05e97b0d in clone () from /lib64/libc.so.6
lightsighter commented 1 year ago

Please make a reproducer for me to play with.

cmelone commented 1 year ago

Log files at /home/cmelone/0507/out/8x8x2Run-9eb0d3a9-63f7-47a6-b9e2-4bfdf6e03e4b

PIDs

Process 681748 on node g0001.stanford.edu is frozen!
Process 306030 on node g0002.stanford.edu is frozen!
Process 197637 on node g0003.stanford.edu is frozen!
Process 129862 on node g0004.stanford.edu is frozen!

Note: I am using the GPU nodes but this is a CPU execution

lightsighter commented 1 year ago

How do I run it for myself?

cmelone commented 1 year ago

Cancel current job:

scancel -u cmelone

Build:

module load slurm
srun -p cpu --pty bash
cd /home/cmelone/0507
# pulls latest version of legion and rebuilds legion and htr
REBUILD=1 ./compile.sh
exit

Run:

# from login node
cd /home/cmelone/0507
REBUILD=0 ./compile.sh

a slurm log will appear in the 0507 directory and point to an output directory

lightsighter commented 1 year ago

Do you have to use the GPU nodes for execution or can it run on the CPU nodes? Also, can we run all four processes on the same node? If not I'll need to wait for other people to get off the nodes.

cmelone commented 1 year ago

I've actually found a way to trigger an error (target != address_space failed) on 1 node, 4 ranks per node. I was using 4 GPU nodes (ignoring CUDA) with the previous configuration because there are only 3 CPU nodes available due to Legion CI cycles.

The same instructions apply to reproduce.

To get today's error (!target->collective_mapping->contains(source) failed), 4 nodes are needed so we'd need to wait for other people to get off the nodes.

Edit: Elliott has updated the partitions to allow for greater capacity (there are plenty of nodes available now), so here are the options depending on the error/config you'd like to target:

lightsighter commented 1 year ago

!target->collective_mapping->contains(source)

Pull, rebuild, and try again.

target != address_space failed

Try with this diff and see what happens:

diff --git a/runtime/legion/legion.cc b/runtime/legion/legion.cc
index 38f6fae0e..b7389c38a 100644
--- a/runtime/legion/legion.cc
+++ b/runtime/legion/legion.cc
@@ -5875,6 +5875,7 @@ namespace Legion {
     //--------------------------------------------------------------------------
     {
       Internal::AutoProvenance provenance(prov);
+      task_local = true;
       return ctx->create_logical_region(index, fields, task_local, provenance);
     }
cmelone commented 1 year ago

Getting the same error for both after applying changes.

lightsighter commented 1 year ago

Can you try making your dump task not be an inner variant?

cmelone commented 1 year ago

after making the dump task not inner, I get the same output.

lightsighter commented 1 year ago

What do you mean by the same output?

cmelone commented 1 year ago

Same error outputted by Legion for both executions

lightsighter commented 1 year ago

Whatever you changed didn't actually reclassify the dumpTask as a non-inner variant:

(gdb) p task_id
$2 = 10161
(gdb) p get_task_name()
$3 = 0x5618377fef10 "dumpTile"
(gdb) p runtime->find_task_impl(10161)
$4 = (Legion::Internal::TaskImpl *) 0x7fe448051710
(gdb) p $4->find_variant_impl(76, false)
$5 = (Legion::Internal::VariantImpl *) 0x7fe4480522b0
(gdb) p $5->inner_variant
$6 = true
cmelone commented 1 year ago

Had the change on another machine, could you try again please?

lightsighter commented 1 year ago

I don't see the same failure mode anymore:

cmelone@sapling2:~/0507$ tail -f slurm-384.out 
Sending output to out/8x8x2Run_842-d21caed6-96ca-4bc3-8f87-5796d2f329d8
Invoking Legion on 4 rank(s), 1 node(s) (4 rank(s) per node), as follows:
/home/cmelone/htr/src/prometeo_ConstPropMix.exec -i 8x8x2Run_842.json -o out/8x8x2Run_842-d21caed6-96ca-4bc3-8f87-5796d2f329d8 -ll:force_kthreads -logfile out/8x8x2Run_842-d21caed6-96ca-4bc3-8f87-5796d2f329d8/%.log -lg:safe_ctrlrepl 2 -level legion_gc=2 -ll:cpu 1 -ll:ocpu 2 -ll:onuma 1 -ll:othr 1 -ll:ostack 8 -ll:util 4 -ll:io 4 -ll:bgwork 2 -ll:cpu_bgwork 100 -ll:util_bgwork 100 -ll:csize 55000 -lg:eager_alloc_percentage 30 -ll:rsize 512 -ll:ib_rsize 512 -ll:gsize 0 -ll:stacksize 8 -lg:sched -1 -lg:hysteresis 0
prometeo_ConstPropMix.exec: prometeo_variables.cc:75: static void UpdatePropertiesFromPrimitiveTask::cpu_base_impl(const UpdatePropertiesFromPrimitiveTask::Args&, const std::vector<Legion::PhysicalRegion>&, const std::vector<Legion::Future>&, Legion::Context, Legion::Runtime*): Assertion `args.mix.CheckMixture(acc_MolarFracs[p])' failed.
Legion process received signal 6: Aborted
Process 502733 on node c0001.stanford.edu is frozen!
prometeo_ConstPropMix.exec: prometeo_variables.cc:75: static void UpdatePropertiesFromPrimitiveTask::cpu_base_impl(const UpdatePropertiesFromPrimitiveTask::Args&, const std::vector<Legion::PhysicalRegion>&, const std::vector<Legion::Future>&, Legion::Context, Legion::Runtime*): Assertion `args.mix.CheckMixture(acc_MolarFracs[p])' failed.

Pull and try again.

cmelone commented 1 year ago

Can confirm that I no longer see the error on the 1 node config, but the error on 4 nodes still comes up.

lightsighter commented 1 year ago

Which error shows up on 4 nodes?

cmelone commented 1 year ago
Assertion `!target->collective_mapping->contains(source)' failed.
lightsighter commented 1 year ago

Please make a reproducer. I can't reproduce it on sapling anymore.

cmelone commented 1 year ago

I am able to reproduce by doing

cd /home/cmelone/0507
CON=8x8x2Run_222 REBUILD=0 ./compile.sh

If you'd like to see the list of frozen processes (I just submitted the job and it's still running), see 0507/slurm-403.out.

lightsighter commented 1 year ago

Pull and try again. I'm on vacation so my ability to debug this is extremely limited.

cmelone commented 1 year ago

Using the latest control_replication commit, I am still seeing

Assertion `!target->collective_mapping->contains(source)' failed.

I re-built HTR and Legion on Sapling, and there is a job (623) currently on the queue waiting to execute for a reproducer.

Edit: the job is now running (see 0507/slurm-623.out)

lightsighter commented 1 year ago

Pull and try again.

cmelone commented 1 year ago

The job made it about 20 minutes in and started to hang. Backtraces are in /scratch/cmelone/bt_05312023. I can run with Legion Spy if that would be helpful.

lightsighter commented 1 year ago

That doesn't look like a hang. Can you check whether the backtraces change over time?

cmelone commented 1 year ago

The program stops progressing at the 18th time step. In /scratch/cmelone/bt_05312023, mid has the backtraces for the 8th time step and end has the backtraces for the 18th time step. I wasn't sure how to interpret the differences but I dumped (randomly picked one node) diff mid/c0003.log end/c0003.log in diff_c0003.log.

lightsighter commented 1 year ago

This is not a freeze, it is a crash in a Realm which looks like a freeze because you set REALM_FREEZE_ON_ERROR=1:

Thread 27 (Thread 0x7fb35bffe840 (LWP 277537)):
#0  0x00007fb44651023f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fb20433c920, rem=rem@entry=0x7fb20433c920) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007fb446515ec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7fb20433c920, remaining=remaining@entry=0x7fb20433c920) at nanosleep.c:27
#2  0x00007fb446515dfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007fb449a298db in Realm::realm_freeze (signal=6) at /home/cmelone/lg/runtime/realm/runtime_impl.cc:187
#4  <signal handler called>
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6  0x00007fb446455859 in __GI_abort () at abort.c:79
#7  0x00007fb446455729 in __assert_fail_base (fmt=0x7fb4465eb588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7fb44a768e1a "amt == len", file=0x7fb44a768df0 "/home/cmelone/lg/runtime/realm/logging.cc", line=129, function=<optimized out>) at assert.c:92
#8  0x00007fb446466fd6 in __GI___assert_fail (assertion=0x7fb44a768e1a "amt == len", file=0x7fb44a768df0 "/home/cmelone/lg/runtime/realm/logging.cc", line=129, function=0x7fb44a768da8 "virtual void Realm::LoggerFileStream::write(const char*, size_t)") at assert.c:101
#9  0x00007fb449ec5693 in Realm::LoggerFileStream::write (this=0x55be800c2940, buffer=0x7fb35bffb2b0 "[5 - 7fb35bffe840] 1459.136160 {2}{legion_gc}: GC Add Base Ref 0 241605 5 25 1\n", len=79) at /home/cmelone/lg/runtime/realm/logging.cc:129
#10 0x00007fb449ec555f in Realm::LoggerFileStream::log_msg (this=0x55be800c2940, level=Realm::Logger::LEVEL_INFO, name=0x7fb44b235630 <Legion::Internal::log_garbage+16> "legion_gc", msgdata=0x7fb35bffc410 "GC Add Base Ref 0 241605 5 25 1", msglen=31) at /home/cmelone/lg/runtime/realm/logging.cc:112
#11 0x00007fb449ec4b0b in Realm::Logger::log_msg (this=0x7fb44b235620 <Legion::Internal::log_garbage>, level=Realm::Logger::LEVEL_INFO, msgdata=0x7fb35bffc410 "GC Add Base Ref 0 241605 5 25 1", msglen=31) at /home/cmelone/lg/runtime/realm/logging.cc:584
#12 0x000055be7fa7db59 in Realm::LoggerMessage::~LoggerMessage (this=0x7fb35bffc3b0, __in_chrg=<optimized out>) at /home/cmelone/legion/runtime/realm/logging--Type <RET> for more, q to quit, c to continue without paging--
.inl:590
#13 0x00007fb448b16eeb in Realm::Logger::info (this=0x7fb44b235620 <Legion::Internal::log_garbage>, fmt=0x7fb44a5ad0b0 "GC Add Base Ref %d %lld %d %d %d") at /home/cmelone/lg/runtime/realm/logging.inl:425
#14 0x00007fb449247c54 in Legion::Internal::log_base_ref<true> (kind=Legion::Internal::GC_REF_KIND, did=241605, local_space=5, src=Legion::Internal::LIVE_EXPR_REF, cnt=1) at /home/cmelone/lg/runtime/legion/garbage_collection.h:456
#15 0x00007fb449248646 in Legion::Internal::DistributedCollectable::check_global_and_increment (this=0x7fb2a804e610, source=Legion::Internal::LIVE_EXPR_REF, cnt=1) at /home/cmelone/lg/runtime/legion/garbage_collection.h:769
#16 0x00007fb44920568c in Legion::Internal::IndexSpaceOperation::try_add_live_reference (this=0x7fb2a804e5a0) at /home/cmelone/lg/runtime/legion/region_tree.cc:7005
#17 0x00007fb4492048ba in Legion::Internal::IndexSpaceExpression::get_canonical_expression (this=0x7fb25802b200, forest=0x55be813208a0) at /home/cmelone/lg/runtime/legion/region_tree.cc:6746
#18 0x00007fb4491ff6f2 in Legion::Internal::RegionTreeForest::intersect_index_spaces (this=0x55be813208a0, lhs=0x7fb2a4117520, rhs=0x7fb25802b200) at /home/cmelone/lg/runtime/legion/region_tree.cc:5708
#19 0x00007fb4490571fb in Legion::Internal::EquivalenceSet::find_valid_instances (this=0x7fb34e879620, analysis=..., expr=0x7fb2a4117520, expr_covers=false, user_mask=..., deferral_events=std::set with 0 elements, applied_events=std::set with 0 elements, already_deferred=false) at /home/cmelone/lg/runtime/legion/legion_analysis.cc:10735
#20 0x00007fb44903faf4 in Legion::Internal::ValidInstAnalysis::perform_analysis (this=0x7fb35bffcec0, set=0x7fb34e879620, expr=0x7fb2a4117520, expr_covers=false, mask=..., deferral_events=std::set with 0 elements, applied_events=std::set with 0 elements, already_deferred=false) at /home/cmelone/lg/runtime/legion/legion_analysis.cc:6639
#21 0x00007fb4490569df in Legion::Internal::EquivalenceSet::analyze (this=0x7fb34e879620, analysis=..., expr=0x7fb2a4117520, expr_covers=false, traversal_mask=..., deferral_events=std::set with 0 elements, applied_events=std::set with 0 elements, already_deferred=false) at /home/cmelone/lg/runtime/legion/legion_analysis.cc:10642
#22 0x00007fb44903ae7c in Legion::Internal::PhysicalAnalysis::analyze (this=0x7fb35bffcec0, set=0x7fb34e879620, mask=..., deferral_events=std::set with 0 elements, applied_events=std::set with 0 elements, precondition=..., already_deferred=false) at /home/cmelone/lg/runtime/legion/legion_analysis.cc:5686
#23 0x00007fb44903b7d7 in Legion::Internal::PhysicalAnalysis::perform_traversal (this=0x7fb35bffcec0, precondition=..., info=..., applied_events=std::set with 0 elements) at /home/cmelone/lg/runtime/legion/legion_analysis.cc:5841
#24 0x00007fb4491ece31 in Legion::Internal::RegionTreeForest::physical_premap_region (this=0x55be813208a0, op=0x7fb294106670, index=0, req=..., version_info=..., targets=..., collectives=..., map_applied_events=std::set with 0 elements) at /home/cmelone/lg/runtime/legion/region_tree.cc:1848
#25 0x00007fb448cff7a5 in Legion::Internal::SingleTask::initialize_map_task_input (this=0x7fb2941064c0, input=..., output=..., must_epoch_owner=0x0) at /home/cmelone/lg/runtime/legion/legion_tasks.cc:2692
#26 0x00007fb448d04459 in Legion::Internal::SingleTask::invoke_mapper (this=0x7fb2941064c0, must_epoch_owner=0x0) at /home/cmelone/lg/runtime/legion/legion_tasks.cc:3567
#27 0x00007fb448d06b11 in Legion::Internal::SingleTask::map_all_regions (this=0x7fb2941064c0, must_epoch_op=0x0, defer_args=0x7fb2e40972c0) at /home/cmelone/lg/runtime/legion/legion_tasks.cc:3965
#28 0x00007fb448d14c85 in Legion::Internal::PointTask::perform_mapping (this=0x7fb2941064c0, must_epoch_owner=0x0, args=0x7fb2e40972c0) at /home/cmelone/lg/runtime/legion/legion_tasks.cc:7142
#29 0x00007fb44932de10 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fb2e40972c0, arglen=52, userdata=0x55be81317510, userlen=8, p=...) at /home/cmelone/lg/runtime/legion/runtime.cc:32327
#30 0x00007fb449d9fe08 in Realm::LocalTaskProcessor::execute_task (this=0x55be7fcc5170, func_id=4, task_args=...) at /home/cmelone/lg/runtime/realm/proc_impl.cc:1147
#31 0x00007fb449ba5ee9 in Realm::Task::execute_on_processor (this=0x7fb2e4097140, p=...) at /home/cmelone/lg/runtime/realm/tasks.cc:303
#32 0x00007fb449baa132 in Realm::KernelThreadTaskScheduler::execute_task (this=0x55be7fdeb540, task=0x7fb2e4097140) at /home/cmelone/lg/runtime/realm/tasks.cc:1367
#33 0x00007fb449ba8e7c in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55be7fdeb540) at /home/cmelone/lg/runtime/realm/tasks.cc:1106
#34 0x00007fb449ba94cd in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55be7fdeb540) at /home/cmelone/lg/runtime/realm/tasks.cc:1218
#35 0x00007fb449bb1bb6 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55be7fdeb540) at /home/cmelone/lg/runtime/realm/threads.inl:97
#36 0x00007fb449b7d8bc in Realm::KernelThread::pthread_entry (data=0x7fb20419a860) at /home/cmelone/lg/runtime/realm/threads.cc:781
#37 0x00007fb44640c609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#38 0x00007fb446552133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@streichler: Thread 27 in PID 199595 on c0004

streichler commented 1 year ago

A logging write failed, so the disk is probably full or there was some network filesystem hiccup.

lightsighter commented 1 year ago

I'd turn off the legion gc logging as it is unnecessary at this point.

cmelone commented 1 year ago

Sounds good. I redirected the output to the scratch filesystem and the execution succeeded. I'm going to re-run the configuration in release mode a few times to verify this; will update.

cmelone commented 1 year ago

I launched this config in ~20 different jobs to check in debug mode, and it both succeeds and throws errors non-deterministically.

I can try reproducing on Sapling but I don't want to fill up the queue with a bunch of 4 node jobs at the moment. Let me know if the reproducer would be helpful though.

In release mode, it throws dozens of warnings (legion warning 1114) like this before the program crashes:

LEGION WARNING: Failed to find a refinement for KD tree with 3 dimensions and 64 rectangles.

Most of the runs crash on an application-level assert that indicates a task received bad data.

This error I encountered in debug mode:

*** double free or corruption (!prev): 0x00007fe3900056c0 ***

bt:

#0  0x00007fe448a259fd in nanosleep () from /lib64/libc.so.6
#1  0x00007fe448a25894 in sleep () from /lib64/libc.so.6
#2  0x00007fe44bdf9f82 in Realm::realm_freeze (signal=6) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/runtime_impl.cc:187
#3  <signal handler called>
#4  0x00007fe448996387 in raise () from /lib64/libc.so.6
#5  0x00007fe448997a78 in abort () from /lib64/libc.so.6
#6  0x00007fe4489d8f67 in __libc_message () from /lib64/libc.so.6
#7  0x00007fe4489e1329 in _int_free () from /lib64/libc.so.6
#8  0x00007fe44be2927a in __gnu_cxx::new_allocator<Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >*>::deallocate (
    this=0x7fe44d73c4c0 <Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul>::get_registered_freelists()::registered_freelists>, __p=0x7fe3900056c0)
    at /opt/ohpc/pub/compiler/gcc/8.3.0/include/c++/8.3.0/ext/new_allocator.h:125
#9  0x00007fe44be27c68 in std::allocator_traits<std::allocator<Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >*> >::deallocate (__a=...,
    __p=0x7fe3900056c0, __n=256) at /opt/ohpc/pub/compiler/gcc/8.3.0/include/c++/8.3.0/bits/alloc_traits.h:462
#10 0x00007fe44be25422 in std::_Vector_base<Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >*, std::allocator<Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >*> >::_M_deallocate (
    this=0x7fe44d73c4c0 <Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul>::get_registered_freelists()::registered_freelists>, __p=0x7fe3900056c0, __n=256)
    at /opt/ohpc/pub/compiler/gcc/8.3.0/include/c++/8.3.0/bits/stl_vector.h:304
#11 0x00007fe44be21aa3 in std::vector<Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >*, std::allocator<Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >*> >::_M_realloc_insert<Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >* const&> (
    this=0x7fe44d73c4c0 <Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul>::get_registered_freelists()::registered_freelists>, __position=...,
    __args#0=@0x7fe15f8a56a8: 0x7fe38a69faf8) at /opt/ohpc/pub/compiler/gcc/8.3.0/include/c++/8.3.0/bits/vector.tcc:469
#12 0x00007fe44be1cb8e in std::vector<Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >*, std::allocator<Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >*> >::push_back (
    this=0x7fe44d73c4c0 <Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul>::get_registered_freelists()::registered_freelists>, __x=@0x7fe15f8a56a8: 0x7fe38a69faf8)
---Type <return> to continue, or q <return> to quit---
    at /opt/ohpc/pub/compiler/gcc/8.3.0/include/c++/8.3.0/bits/stl_vector.h:1085
#13 0x00007fe44be163e1 in Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul>::register_freelist (free_list=0x7fe38a69faf8)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/runtime_impl.h:90
#14 0x00007fe44be0d52d in Realm::DynamicTableFreeList<Realm::DynamicTableAllocator<Realm::GenEventImpl, 11ul, 16ul> >::DynamicTableFreeList (this=0x7fe38a69faf8, _table=..., _owner=12,
    _parent_list=0x1de45f0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/dynamic_table.inl:333
#15 0x00007fe44c110a13 in Realm::ProcessorImpl::ProcessorImpl (this=0x7fe38a69faf0, _me=..., _kind=Realm::Processor::PROC_GROUP, _num_cores=1)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/proc_impl.cc:474
#16 0x00007fe44c111141 in Realm::ProcessorGroupImpl::ProcessorGroupImpl (this=0x7fe38a69faf0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/proc_impl.cc:633
#17 0x00007fe44be24340 in Realm::DynamicTableNode<Realm::ProcessorGroupImpl, 16ul, Realm::UnfairMutex, unsigned long long>::DynamicTableNode (this=0x7fe38a69faa0, _level=0, _first_index=0,
    _last_index=15) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/dynamic_table.inl:46
#18 0x00007fe44be1fb38 in Realm::DynamicTableAllocator<Realm::ProcessorGroupImpl, 10ul, 4ul>::new_leaf_node (first_index=0, last_index=15, owner=1, free_list_head=0x0, free_list_tail=0x0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/runtime_impl.h:109
#19 0x00007fe44be1a17e in Realm::DynamicTable<Realm::DynamicTableAllocator<Realm::ProcessorGroupImpl, 10ul, 4ul> >::new_tree_node (this=0x7fe388000930, level=0, first_index=0, last_index=15,
    owner=1, free_list_head=0x0, free_list_tail=0x0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/dynamic_table.inl:124
#20 0x00007fe44be12241 in Realm::DynamicTable<Realm::DynamicTableAllocator<Realm::ProcessorGroupImpl, 10ul, 4ul> >::lookup_entry (this=0x7fe388000930, index=1, owner=1, free_list_head=0x0,
    free_list_tail=0x0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/dynamic_table.inl:238
#21 0x00007fe44be05572 in Realm::RuntimeImpl::get_procgroup_impl (this=0x1a8f540, id=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/runtime_impl.cc:2727
#22 0x00007fe44be05253 in Realm::RuntimeImpl::get_processor_impl (this=0x1a8f540, id=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/runtime_impl.cc:2698
#23 0x00007fe44c10ec56 in Realm::Processor::spawn (this=0x7fe15f8a6020, func_id=7, args=0x7fe388000d00, arglen=32, reqs=..., wait_on=..., priority=0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/proc_impl.cc:98
#24 0x00007fe44b7b0426 in Legion::Internal::Runtime::handle_endpoint_creation (this=0x2f0da00, derez=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:21073
#25 0x00007fe44b7d301a in Legion::Internal::Runtime::endpoint_runtime_task (args=0x7fe3940a24c0, arglen=28, userdata=0x2f08e40, userlen=8, p=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:32657
---Type <return> to continue, or q <return> to quit---
#26 0x00007fe44c113976 in Realm::LocalTaskProcessor::execute_task (this=0x1e62650, func_id=7, task_args=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/proc_impl.cc:1147
#27 0x00007fe44bf4f864 in Realm::Task::execute_on_processor (this=0x7fe3940a2340, p=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:303
#28 0x00007fe44bf535e8 in Realm::KernelThreadTaskScheduler::execute_task (this=0x1e629b0, task=0x7fe3940a2340)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1367
#29 0x00007fe44bf5245f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x1e629b0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1106
#30 0x00007fe44bf52a82 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x1e629b0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1218
#31 0x00007fe44bf59dba in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x1e629b0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/threads.inl:97
#32 0x00007fe44bf2b10d in Realm::KernelThread::pthread_entry (data=0x7fe37c073290) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/threads.cc:781
#33 0x00007fe448543ea5 in start_thread () from /lib64/libpthread.so.0
#34 0x00007fe448a5eb0d in clone () from /lib64/libc.so.6
cmelone commented 1 year ago

Discussed with @mariodirenzo and found that the issue is back to this condition (i.e. the reduction is not being done as expected)

lightsighter commented 1 year ago

Discussed with @mariodirenzo and found that the issue is back to this https://github.com/StanfordLegion/legion/issues/1420#issuecomment-1502103318 (i.e. the reduction is not being done as expected)

I'm going to ask for the same thing I asked @mariodirenzo for already: I need detailed Legion Spy logs of a failing run with the smallest number of nodes possible and I need to know exactly which points in which field of which region are invalid. At that point I will need to hack Legion Spy to try to analyze just the failing points and attempt to validate the execution so we can determine whether the issue is in Legion or Realm.

double free or corruption (!prev): 0x00007fe3900056c0

@muraj please take a look at this error when you get back from vacation (same error and backtrace as NVBug 4141082).

lightsighter commented 1 year ago

LEGION WARNING: Failed to find a refinement for KD tree with 3 dimensions and 64 rectangles.

You can probably ignore these for now. It just means your partition cannot be spatially decomposed very well, which could cause performance issues, but is not a correctness issue.

mariodirenzo commented 1 year ago

I'm going to ask for the same thing I asked @mariodirenzo for already

The logs are at /home/mariodr/htr/solverTests/RecycleBoundary. The task that has received the unreduced data is called AddRecycleAverage and is half-logged. The field that has not been reduced is called avg_rho

lightsighter commented 1 year ago

What is the name of the region (index space, field space, tree id), the field id of avg_rho, and at least one of the bad points?

mariodirenzo commented 1 year ago

What is the name of the region?

I do not know. It is the second region requirement of the task AddRecycleAverage.

the field id of avg_rho

103

least one of the bad points?

I would need to rerun to give this info. is it fundamental?