Closed rupanshusoi closed 4 months ago
I will need a reproducer that I can run and rebuild myself (next time provide those from the beginning). How did you modify the circuit simulation?
@rupanshusoi Any update on this? I'm making fixing this a requirement for merging control replication into master so I need a reproducer soon.
It's available here: http://sapling2.stanford.edu/~rupanshu/bug1618/
You can do sbatch --nodes 2 sbatch_circuit.sh
to see the error. You might have to manually change the location of regent.py
depending on your setup.
You can see the modifications in this version by grepping for wrapper
. This task creates copies of every region that exists in the top-level task, then executes a fraction of the main inner-loop once or twice (depending on a compile-time parameter). If it executes the loop twice, it will restore every region from its copy beforehand. Let me know if you need more information.
Pull the latest control replication branch and try again.
Fixed, thanks!
I am reopening this issue because I'm hitting another assertion failure with this application:
circuit_ON_10_0: /global/u2/r/rsoi/legion/runtime/legion/legion_context.cc:22900: virtual Legion::Internal::RtEvent Legion::Internal::RemoteContext::compute_equivalence_sets(unsigned int, const std::vector<Legion::Internal::EqSetTracker*>&, const std::vector<unsigned int>&, Legion::AddressSpaceID, Legion::Internal::IndexSpaceExpression*, const Legion::Internal::FieldMask&): Assertion `targets.size() == 1' failed.
The only change is that the wrapper task is now control replicated as well. I am unable to reproduce this error on Sapling.
Full backtrace. This backtrace is weird because it does not show the failing assertion; I'm not sure why that's the case or how to fix it.
You captured the backtraces on the wrong process. Please capture them from the right process where the assertion actually occurred. Make sure you get line numbers too.
I'll also note that if you're hitting this assertion you're doing a very very bad job at mapping if you're also using control replication because it means you're relying on remote mapping when you should be picking better sharding functors so you don't ever need to use remote mapping.
Get a proper backtrace with line numbers, report it here, and then pull the most recent control replication and confirm whether it is fixed of not.
With the latest control replication, the application seems to hang toward the end.
That backtrace makes sense for the error from yesterday.
I need a reproducer for the hang as soon as possible.
The error reproduces on Perlmutter, but not on Sapling. I can make you a reproducer on Perlmutter, but I see you don't currently have an account in our group allocation. What would you suggest?
A reproducer on Perlmutter is not going to work even if I had an account because I can't attach gdb to processes that you create on that machine because I will not have sudo access.
Run with -ll:force_kthreads -lg:inorder -lg:safe_ctrlrepl 1
and get backtraces of all threads from every process. Make sure they are not changing. If the hang does not reproduce with -lg:inorder
then you can remove it, but you have to have -ll:force_kthreads and -lg:safe_ctrlrepl 1
.
The hang does reproduce with -lg:inorder
. I captured two sets of backtraces a few minutes apart. They are mostly the same, so you will have to judge whether this is actually a hang, or something else.
First set: bt1.txt, bt2.txt Second set: bt1-new.txt, bt2-new.txt
Pull and try again. If it continues to hang please get new backtraces with the same options as before.
Now it hits an assertion instead of hanging. Full backtrace. This run was with all three flags you mentioned last time.
Pull and try again.
Pull and try again.
At this point I have to have a reproducer that I can look at. I promise you this will happen on sapling if you inject enough noise into the execution. If you don't know how to do that then just make the reproducer on sapling and tell me how to run it.
It's available here: http://sapling2.stanford.edu/~rupanshu/bug1618/
You can do
sbatch --nodes 2 sbatch_circuit.sh
to see the error. You might have to manually change the location ofregent.py
depending on your setup.You can see the modifications in this version by grepping for
wrapper
. This task creates copies of every region that exists in the top-level task, then executes a fraction of the main inner-loop once or twice (depending on a compile-time parameter). If it executes the loop twice, it will restore every region from its copy beforehand. Let me know if you need more information.
I've updated this directory with the new code.
Which version of GASNetEX are you using on both sapling and perlmutter?
Actually, I think it is the same on both—GasNet-2023.3.0. My legion/language/gasnet
has a sub-directory called GASNet-2023.3.0
in both installations. Does this confirm the GASNet version, or is there another way?
Keep in mind that even with the same GASNet version we're still talking about different networks, so timing differences are always possible.
We already figured this one out. It's not GASNet.
@rupanshusoi Please pull the most recent control replication branch and confirm that the hang is fixed (test it without the temporary fix that I gave you). If it works then you can close the issue.
I'm still seeing the same hang with the latest control replication. I ran with all three flags like last time.
Pull the latest control replication and try again.
Fixed.
I'm running a modified version of Regent Circuit on the latest control_replication (
98f6f2
) on Sapling. 1-node runs work fine, but on 2-nodes I get:This version of Circuit is modified such that the main inner-loop is outlined into a new wrapper task that is launched by the top-level task. The wrapper task is not control-replicated in this run. The rest of the code is basically the same.
Full backtrace: