StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

[HTR] Segmentation faults at 16 nodes #1420

Closed cmelone closed 7 months ago

cmelone commented 1 year ago

I am able to run one of our applications (16 ranks, 1 rank per node) on Legion commit cba415a857c2586b2ad2f4848d6d1cd75de7df00.

However, on 9c6c90b9e3857196da2659a29140f2d7686832bb, I get segmentation faults and non-deterministic errors such as:

prometeo_ConstPropMix.exec: prometeo_variables.cc:75: static void UpdatePropertiesFromPrimitiveTask::cpu_base_impl(const UpdatePropertiesFromPrimitiveTask::Args&, const std::vector<Legion::PhysicalRegion>&, const std::vector<Legion::Future>&, Legion::Context, Legion::Runtime*): Assertion `args.mix.CheckMixture(acc_MolarFracs[p])' failed.
[5 - 7fbc93ba8840] 1193.644387 {6}{realm}: invalid event handle: id=7fbcab057570
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.cc:2509: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.

This program does run successfully with DEBUG=1. I am actively running this test case with smaller configurations to see if I can reproduce outside of this specific config.

Edit:

16 ranks, 4 ranks per node works

lightsighter commented 1 year ago

You have to send backtraces if you want us to debug things.

cmelone commented 1 year ago

Hi,

Running in debug mode does not produce any errors. Are there any alternative strategies I should try to get helpful info about the error?

lightsighter commented 1 year ago

You don't need debug mode to produce a backtrace. Adding REALM_BACKTRACE=1 will produce a backtrace without line numbers. Better is building with CXXFLAGS="-g -O2" and running with REALM_FREEZE_ON_ERROR=1 to freeze the process and attach a debugger to get a backtrace.

cmelone commented 1 year ago

Thank you, I will try that.

cmelone commented 1 year ago

error:

prometeo_ConstPropMix.exec: prometeo_variables.cc:75: static void UpdatePropertiesFromPrimitiveTask::cpu_base_impl(const UpdatePropertiesFromPrimitiveTask::Args&, const std::vector<Legion::PhysicalRegion>&, const std::vector<Legion::Future>&, Legion::Context, Legion::Runtime*): Assertion `args.mix.CheckMixture(acc_MolarFracs[p])' failed.

backtrace:

Thread 3 (Thread 0x7f94fd935840 (LWP 25822)):
#0  0x00007f951262680d in nanosleep () from /lib64/libc.so.6
#1  0x00007f95126266a4 in sleep () from /lib64/libc.so.6
#2  0x00007f9513ea3042 in Realm::realm_freeze(int) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#3  <signal handler called>
#4  0x00007f9512597337 in raise () from /lib64/libc.so.6
#5  0x00007f9512598a28 in abort () from /lib64/libc.so.6
#6  0x00007f9512590156 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007f9512590202 in __assert_fail () from /lib64/libc.so.6
#8  0x00000000009a9c54 in UpdatePropertiesFromPrimitiveTask::_ZN33UpdatePropertiesFromPrimitiveTask13cpu_base_implERKNS_4ArgsERKSt6vectorIN6Legion14PhysicalRegionESaIS5_EERKS3_INS4_6FutureESaISA_EEPNS4_8Internal11TaskContextEPNS4_7RuntimeE._omp_fn.0(void) () at /home/hpcc/gitlabci/multi/codes/htr-cpu-release/src/Mixtures/ConstPropMix.inl:127
#9  0x00007f951411e036 in GOMP_parallel () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#10 0x00000000009ab191 in UpdatePropertiesFromPrimitiveTask::cpu_base_impl (args=..., regions=..., futures=..., ctx=0x7f94b67a0160, runtime=0x1c50890) at prometeo_variables.cc:68
#11 0x00000000009ad981 in base_cpu_wrapper<UpdatePropertiesFromPrimitiveTask> (runtime=<optimized out>, ctx=<optimized out>, regions=..., task=<optimized out>)
    at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/legion/legion.inl:20385
#12 Legion::LegionTaskWrapper::legion_task_wrapper<&TaskHelper::base_cpu_wrapper<UpdatePropertiesFromPrimitiveTask> > (args=<optimized out>, arglen=<optimized out>, 
    userdata=<optimized out>, userlen=<optimized out>, p=...) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/legion/legion.inl:20396
#13 0x00007f951409c171 in Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) ()
   from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#14 0x00007f9513f8f153 in Realm::Task::execute_on_processor(Realm::Processor) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#15 0x00007f9513f8f1e6 in Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#16 0x00007f9513f91b6c in Realm::ThreadedTaskScheduler::scheduler_loop() () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#17 0x00007f9513f776af in Realm::UserThread::uthread_entry() () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#18 0x00007f95125a9140 in ?? () from /lib64/libc.so.6
#19 0x0000000000000000 in ?? ()

i've also attached the full dump from gdb for this process

8x8x2.log

elliottslaughter commented 1 year ago

This looks like an application-level assert? The backtrace shows it going into UpdatePropertiesFromPrimitiveTask::cpu_base_impl and then into an OpenMP loop. The badly manged name at frame #8 is the code inside of an OpenMP parallel for loop.

cmelone commented 1 year ago

I will let @mariodirenzo comment on that aspect. Running the same application with a different input gives the second error I mentioned in the original post. Not sure if this is also considered application level:

Invoking Legion on 16 rank(s), 16 node(s) (1 rank(s) per node), as follows:
/home/hpcc/gitlabci/multi/codes/htr-cpu-release/src/prometeo_ConstPropMix.exec -i 8x2x8Run.json -o 8x2x8 -ll:cpu 1 -ll:ocpu 2 -ll:onuma 1 -ll:othr 7 -ll:ostack 8 -ll:util 2 -ll:io 1 -ll:bgwork 2 -ll:cpu_bgwork 100 -ll:util_bgwork 100 -ll:csize 25000 -ll:rsize 512 -ll:ib_rsize 512 -ll:gsize 0 -ll:stacksize 8 -lg:sched -1 -lg:hysteresis 0
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.cc:2509: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
[11 - 7f38e7bf3840] 1523.549360 {6}{realm}: invalid event handle: id=7f38ff0a1570

backtrace:

Thread 7 (Thread 0x7f38e7bf3840 (LWP 25974)):
#0  0x00007f38fc8e780d in nanosleep () from /lib64/libc.so.6
#1  0x00007f38fc8e76a4 in sleep () from /lib64/libc.so.6
#2  0x00007f38fe164042 in Realm::realm_freeze(int) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#3  <signal handler called>
#4  0x00007f38fc858337 in raise () from /lib64/libc.so.6
#5  0x00007f38fc859a28 in abort () from /lib64/libc.so.6
#6  0x00007f38fc851156 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007f38fc851202 in __assert_fail () from /lib64/libc.so.6
#8  0x00007f38fe168dd6 in Realm::RuntimeImpl::get_event_impl(Realm::Event) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#9  0x00007f38fe341d3e in Realm::Event::has_triggered_faultaware(bool&) const () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#10 0x00007f38fe34b5b3 in Realm::GenEventImpl::merge_events(Realm::span<Realm::Event const, 18446744073709551615ul>, bool) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#11 0x00007f38fdc3fbf8 in Legion::Internal::Runtime::merge_events(Legion::Internal::TraceInfo const*, std::vector<Legion::Internal::ApEvent, std::allocator<Legion::Internal::ApEvent> > const&) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#12 0x00007f38fdc6b3ce in Legion::Internal::CopyFillAggregator::issue_updates(Legion::Internal::PhysicalTraceInfo const&, Legion::Internal::ApEvent, bool, bool, std::map<Legion::Internal::InstanceView*, std::vector<Legion::Internal::ApEvent, std::allocator<Legion::Internal::ApEvent> >, std::less<Legion::Internal::InstanceView*>, std::allocator<std::pair<Legion::Internal::InstanceView* const, std::vector<Legion::Internal::ApEvent, std::allocator<Legion::Internal::ApEvent> > > > >*) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#13 0x00007f38fdc6b822 in Legion::Internal::CopyFillAggregator::handle_aggregation(void const*) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#14 0x00007f38fde06188 in Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#15 0x00007f38fe35d171 in Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#16 0x00007f38fe250153 in Realm::Task::execute_on_processor(Realm::Processor) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#17 0x00007f38fe2501e6 in Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#18 0x00007f38fe252b6c in Realm::ThreadedTaskScheduler::scheduler_loop() () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#19 0x00007f38fe2386af in Realm::UserThread::uthread_entry() () from /home/hpcc/gitlabci/multi/codes/legion-cpu-release/bindings/regent/libregent.so
#20 0x00007f38fc86a140 in ?? () from /lib64/libc.so.6
#21 0x0000000000000000 in ?? ()

8x2x8.log

lightsighter commented 1 year ago

You clearly didn't build the way I told you too because the second backtraces doesn't have line numbers. You need -g in CXXFLAGS to get the line numbers. The second backtrace is almost certainly going to require a reproducer for me to debug.

mariodirenzo commented 1 year ago

This looks like an application-level assert? The backtrace shows it going into UpdatePropertiesFromPrimitiveTask::cpu_base_impl and then into an OpenMP loop. The badly manged name at frame #8 is the code inside of an OpenMP parallel for loop.

yes, that is an application assert that tells us that something has gone terribly wrong in the execution. In the past month, after the merge of the collective branch, I've seen a few of these assertions triggered by bugs in the runtime.

The assertion that concerns me is

prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.cc:2509: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
[11 - 7f38e7bf3840] 1523.549360 {6}{realm}: invalid event handle: id=7f38ff0a1570

The error, for now, seems to show up only with more than 16 ranks and only with one rank per node. As @cmelone mentioned in his initial post, if we place 4 ranks per node, the code is executed correctly. If @cmelone and I find a way to reproduce this issue on less than 4 nodes, I'll be happy to provide the instructions to trigger the problem on Sapling. If we need 16 nodes to reproduce the issue, we need to find a way for @lightsighter to access the Stanford HPCC machine that @cmelone is currently using for the CI.

elliottslaughter commented 1 year ago

@mariodirenzo If you disable PSHM does that allow you to replicate on a 4-ranks-per-node configuration?

https://github.com/StanfordLegion/gasnet/blob/master/configs/config.ibv.release#L10

cmelone commented 1 year ago

You clearly didn't build the way I told you too because the second backtraces doesn't have line numbers. You need -g in CXXFLAGS to get the line numbers. The second backtrace is almost certainly going to require a reproducer for me to debug.

Apologies, I had compiled the application with the flag, not Legion. This backtrace should be complete for the UpdatePropertiesFromPrimitiveTask error:

Thread 3 (Thread 0x7f1f2074d840 (LWP 4623)):
#0  0x00007f1f3543d80d in nanosleep () from /lib64/libc.so.6
#1  0x00007f1f3543d6a4 in sleep () from /lib64/libc.so.6
#2  0x00007f1f36cbb1f2 in Realm::realm_freeze (signal=<optimized out>) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.cc:183
#3  <signal handler called>
#4  0x00007f1f353ae337 in raise () from /lib64/libc.so.6
#5  0x00007f1f353afa28 in abort () from /lib64/libc.so.6
#6  0x00007f1f353a7156 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007f1f353a7202 in __assert_fail () from /lib64/libc.so.6
#8  0x00000000009a9c34 in UpdatePropertiesFromPrimitiveTask::_ZN33UpdatePropertiesFromPrimitiveTask13cpu_base_implERKNS_4ArgsERKSt6vectorIN6Legion14PhysicalRegionESaIS5_EERKS3_INS4_6FutureESaISA_EEPNS4_8Internal11TaskContextEPNS4_7RuntimeE._omp_fn.0(void) () at /home/hpcc/gitlabci/multi/codes/htr-cpu-release/src/Mixtures/ConstPropMix.inl:127
#9  0x00007f1f36f361e6 in GOMP_parallel (
    fnptr=fnptr@entry=0x9a95e0 <UpdatePropertiesFromPrimitiveTask::_ZN33UpdatePropertiesFromPrimitiveTask13cpu_base_implERKNS_4ArgsERKSt6vectorIN6Legion14PhysicalRegionESaIS5_EERKS3_INS4_6FutureESaISA_EEPNS4_8Internal11TaskContextEPNS4_7RuntimeE._omp_fn.0(void)>, data=data@entry=0x7f1efd800c10, nthreads=nthreads@entry=0, flags=flags@entry=0)
    at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/openmp/openmp_api.cc:186
#10 0x00000000009ab171 in UpdatePropertiesFromPrimitiveTask::cpu_base_impl (args=..., regions=..., futures=..., ctx=0x7f1f058ae410, runtime=0x22a14b0) at prometeo_variables.cc:68
#11 0x00000000009ad961 in base_cpu_wrapper<UpdatePropertiesFromPrimitiveTask> (runtime=<optimized out>, ctx=<optimized out>, regions=..., task=<optimized out>)
    at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/legion/legion.inl:20385
#12 Legion::LegionTaskWrapper::legion_task_wrapper<&TaskHelper::base_cpu_wrapper<UpdatePropertiesFromPrimitiveTask> > (args=<optimized out>, arglen=<optimized out>, userdata=<optimized out>, 
    userlen=<optimized out>, p=...) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/legion/legion.inl:20396
#13 0x00007f1f36eb4321 in Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) () at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/bytearray.inl:150
#14 0x00007f1f36da7303 in Realm::Task::execute_on_processor(Realm::Processor) () at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.h:398
#15 0x00007f1f36da7396 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/tasks.cc:1632
#16 0x00007f1f36da9d1c in Realm::ThreadedTaskScheduler::scheduler_loop() () at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/tasks.cc:1103
#17 0x00007f1f36d8f85f in Realm::UserThread::uthread_entry() () at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/threads.cc:1355
#18 0x00007f1f353c0140 in ?? () from /lib64/libc.so.6
#19 0x0000000000000000 in ?? ()

If you disable PSHM does that allow you to replicate on a 4-ranks-per-node configuration?

Currently testing this, will get back to you.

lightsighter commented 1 year ago

Please start by getting a backtrace of the runtime crash with line numbers.

cmelone commented 1 year ago

If you disable PSHM does that allow you to replicate on a 4-ranks-per-node configuration?

Running the same application with this config yields this error:

srun: error: compute-10-5: task 0: Killed
slurmstepd: error: compute-10-5 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Success (0)
elliottslaughter commented 1 year ago

Ok. That is a strange error, but I'm guessing it's not worth your time to debug unless you're dead set on shrinking the reproducer to fit in Sapling.

If you want to debug it, we'd probably need to talk to the GASNet guys to get an opinion on what's going on here.

cmelone commented 1 year ago

I may just focus on getting the backtrace for the runtime crash on 1 rank per node for now. The tricky thing is that the error alternates between the application and runtime crashes, so it's difficult to reproduce deterministically. I will get the backtrace as soon as it becomes available

lightsighter commented 1 year ago

A backtrace with line numbers might go a long way to telling me what is actually going wrong.

cmelone commented 1 year ago
Legion process received signal 11: Segmentation fault

I believe this should be it:

Thread 7 (Thread 0x7f1ea7070840 (LWP 5910)):
#0  0x00007f1ebbd6680d in nanosleep () from /lib64/libc.so.6
#1  0x00007f1ebbd666a4 in sleep () from /lib64/libc.so.6
#2  0x00007f1ebd5e41f2 in Realm::realm_freeze (signal=<optimized out>) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.cc:183
#3  <signal handler called>
#4  Realm::Event::has_triggered_faultaware (this=0x0, poisoned=@0x7f1e8a07e70f: false) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/event_impl.cc:69
#5  0x00007f1ebd7cb763 in Realm::GenEventImpl::merge_events(Realm::span<Realm::Event const, 18446744073709551615ul>, bool) ()
    at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/event_impl.cc:1075
#6  0x00007f1ebd7cbbe8 in Realm::Event::merge_events (wait_for=...) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/event_impl.cc:98
#7  0x00007f1ebd0be2f8 in Legion::Internal::Runtime::merge_events (info=0x7f1e62f6c100, events=...) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/legion/runtime.h:5171
#8  0x00007f1ebd0e9ace in Legion::Internal::CopyFillAggregator::issue_updates(Legion::Internal::PhysicalTraceInfo const&, Legion::Internal::ApEvent, bool, bool, std::map<Legion::Internal::InstanceView*, std::vector<Legion::Internal::ApEvent, std::allocator<Legion::Internal::ApEvent> >, std::less<Legion::Internal::InstanceView*>, std::allocator<std::pair<Legion::Internal::InstanceView* const, std::vector<Legion::Internal::ApEvent, std::allocator<Legion::Internal::ApEvent> > > > >*) () at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/legion/legion_analysis.cc:5113
#9  0x00007f1ebd0e9f22 in Legion::Internal::CopyFillAggregator::handle_aggregation (args=0x7f1e62f6c0f0) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/legion/legion_analysis.cc:5569
#10 0x00007f1ebd2848b8 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f1e62f6c0f0, arglen=100, userdata=<optimized out>, userlen=<optimized out>, p=...)
    at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/legion/runtime.cc:32038
#11 0x00007f1ebd7dd321 in Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) () at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/bytearray.inl:150
#12 0x00007f1ebd6d0303 in Realm::Task::execute_on_processor(Realm::Processor) () at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.h:398
#13 0x00007f1ebd6d0396 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/tasks.cc:1632
#14 0x00007f1ebd6d2d1c in Realm::ThreadedTaskScheduler::scheduler_loop() () at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/tasks.cc:1103
#15 0x00007f1ebd6b885f in Realm::UserThread::uthread_entry() () at /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/threads.cc:1355
#16 0x00007f1ebbce9140 in ?? () from /lib64/libc.so.6
#17 0x0000000000000000 in ?? ()
lightsighter commented 1 year ago

Pull and try again.

cmelone commented 1 year ago

After re-compiling, I am getting the original application error. Would it be useful to see the backtrace for this crash or do you want me to keep trying for a runtime crash?

lightsighter commented 1 year ago

I think we fixed the runtime error, and now you have an application bug to address.

mariodirenzo commented 1 year ago

Is there anything specific that you would advise looking at considering that the same setup with the same version of HTR was deterministically successful on older versions of Legion (before the merge of collective) and that the test is deterministically successful even with the current version of Legion if we increase the number of ranks per node?

lightsighter commented 1 year ago

Be very careful with your reductions, especially if you're getting raw pointers with your reduction accessors. Make sure you always perform actual reductions and don't overwrite the reduction buffers (cuNumeric folks had a very nasty privilege violation bug they were doing like this that resulted in non-deterministic failures).

If you think something is wrong with Legion, then Legion Spy is always the sanity check and will validate multi-node runs with control replication now.

mariodirenzo commented 1 year ago

I've been able to reproduce the problem on Sapling and I have a particular point of the code where the run non-deterministically fails.

In the setup that @cmelone has identified, there is a task index space launched on 8 points which has reduction privileges on a few data fields with a sum operator. Each of these instances of the task produces a valid reduction instance where every point of the reduction instance looks sound (I'm checking this aspect with a loop at the end of the task).

A single-point task consumes the reduced data. However, this task randomly receives a few points of one of the reduced fields that are zero. These randomly zeroed points generate a NaN in the calculation that leads to the assertion

prometeo_ConstPropMix.exec: prometeo_variables.cc:75: static void UpdatePropertiesFromPrimitiveTask::cpu_base_impl(const UpdatePropertiesFromPrimitiveTask::Args&, const std::vector<Legion::PhysicalRegion>&, const std::vector<Legion::Future>&, Legion::Context, Legion::Runtime*): Assertion `args.mix.CheckMixture(acc_MolarFracs[p])' failed.

reported by @cmelone.

I haven't been able to reproduce the issue with detailed legion spy logging but, even if I was able to do it, I would not be able to run the logical and physical analysis because the problem happens well into the calculation and the memory requirements of these analyses would not fit in any computer where I have access to.

If I had to say, the consuming task is racing with the reduction among the reduction instances performed by the runtime. Most of the time the consuming task is executed after the reduction and everything goes right. Sometimes, the consuming task is executed too early and it sees a field of zeroes.

Let me know if you have any advice on how to proceed with the debugging of this issue

lightsighter commented 1 year ago

If I had to say, the consuming task is racing with the reduction among the reduction instances performed by the runtime.

What makes you say this? This theory conflicts with this piece of evidence:

A single-point task consumes the reduced data. However, this task randomly receives a few points of one of the reduced fields that are zero.

If your theory is true then there would be more than just a few points of the reduced fields that are zero. You would see entire ranges of the consumed instance that were zero because reductions hadn't been applied to them yet. The one and only exception would be if the "few points" of zeros were all contiguous and at the very "end" of the field in the consuming instance.

More questions:

mariodirenzo commented 1 year ago

The one and only exception would be if the "few points" of zeros were all contiguous and at the very "end" of the field in the consuming instance.

The zero points are all contiguous and I've only observed at the end of the 1D reduction instance

This program runs correctly with bounds checks enabled (and no raw pointers)?

I am not using raw pointers and the runs that I am using to debug are executed with bounds checks enabled

You're using Legion's sum reduction or your own? For which data type?

The data that is zero uses Legion's sum reduction for a double field in the task implementation. I guess that Regent is also using it during the task launch but I need @elliottslaughter confirm it

What are the locations of the source reduction instances and the consumed normal instance?

These are all in the system memory

Are the zeros contiguous or spread out from each other?

I've only seen them contiguous at the end of the 1D instance

Are the zeros always in the same field? If so, where is that field laid out relative to the other fields in the instance?

I am only checking one field that I know cannot be zero. There are other reductions but I can't tell if the zero is correct or wrong

Are you using tracing? If so does the bad data occur during the trace capture phase or the trace replay phase?

Yes, I am using tracing and the problem happens during trace replays

You've validated that a program with a smaller number of iterations with Legion Spy?

I've validated the program on one node/ one rank. However, that configuration is deterministically successful. I can try to get the legion spy logs for a run that potentially fails.

How many nodes/GPUs are required to reproduce the issue?

Four nodes with four ranks each for a total of 16 ranks. I am not using GPUs.

C++ or Regent (or both)?

This is mixed Regent and C++

elliottslaughter commented 1 year ago

This program runs correctly with bounds checks enabled (and no raw pointers)?

I am not using raw pointers and the runs that I am using to debug are executed with bounds checks enabled

And do you use also use -fbounds-checks 1 for Regent? (Note it doesn't work with CUDA.)

You're using Legion's sum reduction or your own? For which data type?

The data that is zero uses Legion's sum reduction for a double field in the task implementation. I guess that Regent is also using it during the task launch but I need @elliottslaughter confirm it

What data type is it?

Regent uses Legion's reductions for scalar types (int, double, etc.) but not for arrays (e.g., double[4]).

mariodirenzo commented 1 year ago

And do you use also use -fbounds-checks 1 for Regent? (Note it doesn't work with CUDA.)

yes, I am passing the flag and this is a CPU execution

What data type is it?

It is a double field

lightsighter commented 1 year ago

I've validated the program on one node/ one rank. However, that configuration is deterministically successful. I can try to get the legion spy logs for a run that potentially fails.

Let's get logs for a few iterations of a run configuration that might fail (it doesn't have to fail). Since we're tracing, the trace replays will be the same once we start replaying them so we don't need many iterations to find a problem.

mariodirenzo commented 1 year ago

The logs are at /home/mariodr/htr/solverTests/RecycleBoundary/ on Sapling. These are generated on 3 time steps to ensure that the trace is replayed at least once. However, I've been running legion spy on them for 12 hours and I am still at 17% of the Computing physical reachable step...

mariodirenzo commented 1 year ago

After 20 hrs, Computing physical reachable is still at 25%. Unfortunately, it is hard to use legion spy for calculations on many ranks.

elliottslaughter commented 1 year ago

Are you using PyPy to run Legion Spy? I've seen pretty dramatic speedups for some of the stuff we do.

mariodirenzo commented 1 year ago

Yes, I am

lightsighter commented 1 year ago

Can we verify a program from a smaller number of nodes first? Probably four would be a good number.

mariodirenzo commented 1 year ago

I've never seen the error on less than 16 ranks. Is it still useful to have the logs for a condition that might never fail?

lightsighter commented 1 year ago

Let's at least try it and see if we find anything. Keep the bigger verification going in the meantime until it runs out of memory.

mariodirenzo commented 1 year ago

I've had a complete run of Legion spy only for a single-node calculation with 2 tiles and three time steps. This very small calculation, which has never failed in the CI, passes the legion spy logical and physical analysis. If I increase the number of tiles or the number of ranks, legion spy either runs out of memory (requires more than 256Gb of memory) or it takes forever (more than 4 days and it has not yet finished) to run the logical analysis. Considering that we are able to reproduce the problem only with 128 tiles, 16 ranks, and more than 10 time steps , I'd say that logical and physical analyses are not viable options to debug this problem.

lightsighter commented 1 year ago

(As I mentioned on slack) There is a performance bug in Legion Spy with very large fan-out graphs which can be addressed first and then we can see if it finds anything.

lightsighter commented 1 year ago

@mariodirenzo Pull the most recent control replication, rebuild, regenerate your logs, and then try doing at least the logical analysis verification both of a small run, and a bigger run that fails. The logical analysis verification algorithm should be considerably faster. I did have to change the logging of predicate operations (which you are using) in order to get Legion Spy's new logical verification algorithm to work, so you will need the latest control replication and newly generated log files.

mariodirenzo commented 1 year ago

The new version of legion spy is much faster. Thanks for implementing it.

The current verifications that I have run are:

The failing configuration has 8x8x2 tiles on 16 ranks (2x2x2 tiles per rank).

I guess that we need to find another way to discover what is going on.

cmelone commented 1 year ago

I built and executed HTR with the newest control_replication (DEBUG=1), and a new error message is thrown.

legion/runtime/legion/legion_utilities.h:1331: void Legion::Deserializer::end_context(): Assertion `sent_context == context_bytes' failed.
Thread 3 (Thread 0x7f2bd9fee840 (LWP 7395)):
#0  0x00007f2ee89ac9fd in nanosleep () from /lib64/libc.so.6
#1  0x00007f2ee89ac894 in sleep () from /lib64/libc.so.6
#2  0x00007f2eebc35de2 in Realm::realm_freeze (signal=6) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/runtime_impl.cc:183
#3  <signal handler called>
#4  0x00007f2ee891d387 in raise () from /lib64/libc.so.6
#5  0x00007f2ee891ea78 in abort () from /lib64/libc.so.6
#6  0x00007f2ee89161a6 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007f2ee8916252 in __assert_fail () from /lib64/libc.so.6
#8  0x00007f2eeb04da31 in Legion::Deserializer::end_context (this=0x7f2bd9fece10) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_utilities.h:1331
#9  0x00007f2eeb04d3cb in Legion::Internal::DerezCheck::~DerezCheck (this=0x7f2bd9fecd38, __in_chrg=<optimized out>)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_utilities.h:501
#10 0x00007f2eeb38e4f0 in Legion::Internal::CollectiveView::handle_distribute_broadcast (runtime=0x35d34a0, source=4, derez=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_views.cc:10398
#11 0x00007f2eeb67b3bc in Legion::Internal::Runtime::handle_collective_distribute_broadcast (this=0x35d34a0, derez=..., source=4)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:24500
#12 0x00007f2eeb64cc22 in Legion::Internal::VirtualChannel::handle_messages (this=0x7f2bf40017d0, num_messages=1, runtime=0x35d34a0, remote_address_space=4, args=0x7f2e35bd3760 "", arglen=232)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:12571
#13 0x00007f2eeb64bb6a in Legion::Internal::VirtualChannel::process_message (this=0x7f2bf40017d0, args=0x7f2e35bd3744, arglen=252, runtime=0x35d34a0, remote_address_space=4)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:11917
#14 0x00007f2eeb64e0dd in Legion::Internal::MessageManager::receive_message (this=0x7f2bf4000900, args=0x7f2e35bd3740, arglen=260)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:13595
#15 0x00007f2eeb67f72c in Legion::Internal::Runtime::process_message_task (this=0x35d34a0, args=0x7f2e35bd373c, arglen=264)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:26334
#16 0x00007f2eeb694c44 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f2e35bd3730, arglen=268, userdata=0x35ec820, userlen=8, p=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:32064
#17 0x00007f2eebf472c2 in Realm::LocalTaskProcessor::execute_task (this=0x2588ae0, func_id=4, task_args=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/proc_impl.cc:1129
#18 0x00007f2eebd84964 in Realm::Task::execute_on_processor (this=0x7f2e35bd34b0, p=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:302
#19 0x00007f2eebd886e8 in Realm::KernelThreadTaskScheduler::execute_task (this=0x2588dd0, task=0x7f2e35bd34b0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1366
#20 0x00007f2eebd8755f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x2588dd0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1105
#21 0x00007f2eebd87b82 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x2588dd0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1217
#22 0x00007f2eebd8eeba in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x2588dd0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/threads.inl:97
#23 0x00007f2eebd60c57 in Realm::KernelThread::pthread_entry (data=0x7f2b1c0131f0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/threads.cc:781
#24 0x00007f2ee84caea5 in start_thread () from /lib64/libpthread.so.0
#25 0x00007f2ee89e5b0d in clone () from /lib64/libc.so.6
lightsighter commented 1 year ago

Pull and try again.

cmelone commented 1 year ago

Same error.

lightsighter commented 1 year ago

Pull and try again.

cmelone commented 1 year ago

It's no longer throwing an error at the same point as before, but it has been stuck at the second time step for the past 90 minutes. Will update if that changes

lightsighter commented 1 year ago

Run with -ll:force_kthreads and dump backtraces. If possible dump detailed Legion Spy logs.

cmelone commented 1 year ago

Here are the backtraces. Will run with spy.

bt_htr_050723.tar.gz

cmelone commented 1 year ago

Running with Legion Spy, got this error:

legion/runtime/legion/legion_trace.cc:3581: void Legion::Internal::TraceViewSet::antialias_collective_view(Legion::Internal::CollectiveView*, const FieldMask&, Legion::Internal::FieldMaskSet<Legion::Internal::InstanceView>&): Assertion `conditions.find(collective) == conditions.end()' failed.
(gdb) bt
#0  0x00007fbfd47909fd in nanosleep () from /lib64/libc.so.6
#1  0x00007fbfd4790894 in sleep () from /lib64/libc.so.6
#2  0x00007fbfd7a19e96 in Realm::realm_freeze (signal=6) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/runtime_impl.cc:187
#3  <signal handler called>
#4  0x00007fbfd4701387 in raise () from /lib64/libc.so.6
#5  0x00007fbfd4702a78 in abort () from /lib64/libc.so.6
#6  0x00007fbfd46fa1a6 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007fbfd46fa252 in __assert_fail () from /lib64/libc.so.6
#8  0x00007fbfd704319d in Legion::Internal::TraceViewSet::antialias_collective_view (this=0x7fbce037ed50, collective=0x7fbccc1f3c90, mask=..., alternative_views=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_trace.cc:3581
#9  0x00007fbfd703c585 in Legion::Internal::TraceViewSet::invalidate_all_but (this=0x7fbce037ed50, except=0x7fbccc1f3c90, expr=0x7fbf1028cad0, mask=..., expr_refs_to_remove=0x0, view_refs_to_remove=0x0, antialiased=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_trace.cc:2535
#10 0x00007fbfd720b05a in Legion::Internal::EquivalenceSet::update_tracing_valid_views (this=0x7fbc5c08eea0, view=0x7fbccc1f3c90, expr=0x7fbf1028cad0, usage=..., user_mask=..., invalidates=true)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:17131
#11 0x00007fbfd71c7929 in Legion::Internal::CopyFillAggregator::update_tracing_valid_views (this=0x7fbc9003eee0, tracing_eq=0x7fbc5c08eea0, src=0x7fbc5405a110, dst=0x7fbccc1f3c90, mask=..., expr=0x7fbf1028cad0, redop=1048587)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:5168
#12 0x00007fbfd71c6f9d in Legion::Internal::CopyFillAggregator::record_reductions (this=0x7fbc9003eee0, dst_view=0x7fbccc1f3c90, dst_man=0x7fbc5404ba80, src_views=..., src_fidx=2, dst_fidx=2, tracing_eq=0x7fbc5c08eea0, across_helper=0x0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:5014
#13 0x00007fbfd71f9767 in Legion::Internal::EquivalenceSet::apply_reduction (this=0x7fbc5c08eea0, target=0x7fbccc1f3c90, target_manager=0x7fbc5404ba80, expr=0x7fbf1028cad0, expr_covers=true, reduction_mask=...,
    aggregator=@0x7fbf067f0928: 0x7fbc9003eee0, previous_guard=0x0, analysis=0x7fbc5405b560, track_events=false, trace_info=..., applied_exprs=0x7fbf067f04c0, across_helper=0x0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:14548
#14 0x00007fbfd71f8fcc in Legion::Internal::EquivalenceSet::apply_reductions (this=0x7fbc5c08eea0, target_instances=..., target_views=..., expr=0x7fbf1028cad0, expr_covers=true, reduction_mask=..., aggregator=@0x7fbf067f0928: 0x7fbc9003eee0,
    previous_guard=0x0, analysis=0x7fbc5405b560, track_events=false, trace_info=..., applied_exprs=0x7fbf067f04c0, across_helper=0x0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:14449
#15 0x00007fbfd71f2f37 in Legion::Internal::EquivalenceSet::update_set_internal (this=0x7fbc5c08eea0, input_aggregator=@0x7fbf067f0928: 0x7fbc9003eee0, previous_guard=0x0, analysis=0x7fbc5405b560, usage=..., expr=0x7fbf1028cad0, expr_covers=true,
    user_mask=..., target_instances=..., target_views=..., source_views=..., trace_info=..., record_valid=true) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:13384
#16 0x00007fbfd71eb263 in Legion::Internal::EquivalenceSet::update_set (this=0x7fbc5c08eea0, analysis=..., expr=0x7fbf1028cad0, expr_covers=true, user_mask=..., deferral_events=..., applied_events=..., already_deferred=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:11656
#17 0x00007fbfd71d2f5f in Legion::Internal::UpdateAnalysis::perform_analysis (this=0x7fbc5405b560, set=0x7fbc5c08eea0, expr=0x7fbf1028cad0, expr_covers=true, mask=..., deferral_events=..., applied_events=..., already_deferred=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:7343
#18 0x00007fbfd71e4e50 in Legion::Internal::EquivalenceSet::analyze (this=0x7fbc5c08eea0, analysis=..., expr=0x7fbf1028cad0, expr_covers=true, traversal_mask=..., deferral_events=..., applied_events=..., already_deferred=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:10642
#19 0x00007fbfd71ca6f0 in Legion::Internal::PhysicalAnalysis::analyze (this=0x7fbc5405b560, set=0x7fbc5c08eea0, mask=..., deferral_events=..., applied_events=..., precondition=..., already_deferred=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:5672
#20 0x00007fbfd71cc790 in Legion::Internal::PhysicalAnalysis::handle_deferred_analysis (args=0x7fbc5404a0d0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/legion_analysis.cc:6123
#21 0x00007fbfd74795e1 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fbc5404a0d0, arglen=44, userdata=0x165d6c0, userlen=8, p=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/legion/runtime.cc:32454
#22 0x00007fbfd7d2b412 in Realm::LocalTaskProcessor::execute_task (this=0x16e7f20, func_id=4, task_args=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/proc_impl.cc:1129
#23 0x00007fbfd7b68ab4 in Realm::Task::execute_on_processor (this=0x7fbc54049f50, p=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:302
#24 0x00007fbfd7b6c838 in Realm::KernelThreadTaskScheduler::execute_task (this=0x16e8210, task=0x7fbc54049f50) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1366
#25 0x00007fbfd7b6b6af in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x16e8210) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1105
#26 0x00007fbfd7b6bcd2 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x16e8210) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/tasks.cc:1217
#27 0x00007fbfd7b7300a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x16e8210) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/threads.inl:97
#28 0x00007fbfd7b44da7 in Realm::KernelThread::pthread_entry (data=0x7fbcc818cbc0) at /home/hpcc/gitlabci/psaap-ci/artifacts/4044111770/legion/runtime/realm/threads.cc:781
#29 0x00007fbfd42aeea5 in start_thread () from /lib64/libpthread.so.0
#30 0x00007fbfd47c9b0d in clone () from /lib64/libc.so.6

can supply the logs if they would be helpful

lightsighter commented 1 year ago

I will need a minimal reproducer.

lightsighter commented 1 year ago

Pull and try again to see if the hang has gone away. I still would like a minimal reproducer for the crash with Legion Spy.

cmelone commented 1 year ago

Can confirm that the hang has disappeared and HTR now throws the error I posted in my previous thread. I'll work with @mariodirenzo re: minimal reproducer.