Closed syamajala closed 1 year ago
What is the did
of the DistributedCollectable
in frame 8 of thread 11?
Also, what commit of shardrefine
are you on?
To be very clear this is not a hang, this is a livelock, and your stack traces should continue to be changing.
The commit is:
commit c4ff5e0d1bb1e01b1b481bb934b4a8b15d36513e (HEAD -> shardrefine, origin/shardrefine)
Author: Mike Bauer <mike@lightsighter.org>
Date: Sat Aug 19 18:07:19 2023 -0700
legion: fixes for logical analysis of refinements
Running again it does appear the stack traces are changing. I don't see any threads with a DistributedCollectable when I run it again.
I'm can't seem to run S3D on sapling right now, I see processes dying at start up every time I run and then the node goes into a drained state in slurm and I have to reboot to run again. This problem has been intermittent on sapling.
I was able to get it to run on sapling. It only starts to appears when at 8 ranks.
There are some processes here on c0001:
11846
11847
11848
11849
11850
11851
11852
11853
This is not hanging the same way that the backtraces above are. What is the output of running with -level shutdown=2
?
It looks like rank 0 shuts down but the others dont? Here is the last 20 lines from each log:
==> run_0.log <==
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[0 - 7f2737b8cc40] 19.244778 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 21008, provenance: launch.rg:143) in parent task main (UID 24) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (272,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[0 - 7f2737b8cc40] 78.553474 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 7f2737b8cc40] 78.557566 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 7f2737b8cc40] 78.557580 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 7f2737b8cc40] 78.558263 {2}{shutdown}: SHUTDOWN PHASE 2 SUCCESS!
[0 - 7f2737b8cc40] 78.605108 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40] 79.058921 {2}{shutdown}: FAILED SHUTDOWN PHASE 3! Trying again...
[0 - 7f2737b8cc40] 79.309960 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40] 79.318945 {2}{shutdown}: FAILED SHUTDOWN PHASE 3! Trying again...
[0 - 7f2737b8cc40] 79.319043 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40] 79.319764 {2}{shutdown}: SHUTDOWN PHASE 3 SUCCESS!
[0 - 7f2737b8cc40] 79.319776 {2}{shutdown}: Received notification on node 0 for phase 4
[0 - 7f2737b8cc40] 79.321480 {2}{shutdown}: SHUTDOWN PHASE 4 SUCCESS!
[0 - 7f2737b8cc40] 79.321491 {2}{shutdown}: SHUTDOWN SUCCEEDED!
==> run_1.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[1 - 7fc1ea841c40] 21.735396 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 22793, provenance: launch.rg:143) in parent task main (UID 1) is using uninitialized data for field(s) 140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156 of logical region (265,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[1 - 7fc1ea841c40] 78.554181 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 7fc1ea841c40] 78.556380 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 7fc1ea841c40] 78.603794 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40] 78.665194 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.666287 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.667397 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.668509 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.669613 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.670706 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.671813 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.672910 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 79.315712 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40] 79.317701 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40] 79.318430 {2}{shutdown}: Received notification on node 1 for phase 4
==> run_2.log <==
[2 - 7f3fda49ac40] 19.264086 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20954, provenance: launch.rg:143) in parent task main (UID 2) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (226,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[2 - 7f3fda49ac40] 78.553070 {2}{shutdown}: Received notification on node 2 for phase 1
[2 - 7f3fda49ac40] 78.555253 {2}{shutdown}: Received notification on node 2 for phase 2
[2 - 7f3fda49ac40] 78.602783 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40] 78.646867 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.647954 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.649048 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.650144 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.651251 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.652348 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.653438 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.654536 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 79.307822 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40] 79.310020 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 79.316688 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40] 79.317413 {2}{shutdown}: Received notification on node 2 for phase 4
==> run_3.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[3 - 7f010f3dfc40] 19.262353 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20963, provenance: launch.rg:143) in parent task main (UID 3) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (227,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[3 - 7f010f3dfc40] 78.555201 {2}{shutdown}: Received notification on node 3 for phase 1
[3 - 7f010f3dfc40] 78.557366 {2}{shutdown}: Received notification on node 3 for phase 2
[3 - 7f010f3dfc40] 78.604905 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40] 78.651860 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.652946 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.654041 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.655147 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.656237 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.657323 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.658416 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.659510 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 79.309802 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40] 79.318817 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40] 79.319548 {2}{shutdown}: Received notification on node 3 for phase 4
==> run_4.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[4 - 7f62a7988c40] 19.263172 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20900, provenance: launch.rg:143) in parent task main (UID 4) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (228,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[4 - 7f62a7988c40] 78.554426 {2}{shutdown}: Received notification on node 4 for phase 1
[4 - 7f62a7988c40] 78.556610 {2}{shutdown}: Received notification on node 4 for phase 2
[4 - 7f62a7988c40] 78.604149 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40] 78.648378 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.649463 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.650560 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.651630 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.652726 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.653818 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.654918 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.656023 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 79.309021 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40] 79.318052 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40] 79.318783 {2}{shutdown}: Received notification on node 4 for phase 4
==> run_5.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[5 - 7f349e602c40] 19.255487 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20933, provenance: launch.rg:143) in parent task main (UID 5) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (229,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[5 - 7f349e602c40] 78.554366 {2}{shutdown}: Received notification on node 5 for phase 1
[5 - 7f349e602c40] 78.556555 {2}{shutdown}: Received notification on node 5 for phase 2
[5 - 7f349e602c40] 78.604089 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40] 78.648025 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.649135 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.650239 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.651332 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.652416 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.653509 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.654599 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.655686 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 79.308970 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40] 79.317994 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40] 79.318728 {2}{shutdown}: Received notification on node 5 for phase 4
==> run_6.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[6 - 7fe783654c40] 19.282687 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20966, provenance: launch.rg:143) in parent task main (UID 6) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (230,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[6 - 7fe783654c40] 78.554743 {2}{shutdown}: Received notification on node 6 for phase 1
[6 - 7fe783654c40] 78.556919 {2}{shutdown}: Received notification on node 6 for phase 2
[6 - 7fe783654c40] 78.604456 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40] 78.648362 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.649463 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.650548 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.651650 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.652738 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.653835 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.654937 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.656037 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 79.309339 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40] 79.318367 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40] 79.319098 {2}{shutdown}: Received notification on node 6 for phase 4
==> run_7.log <==
[7 - 7f0d18789c40] 19.269394 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20943, provenance: launch.rg:143) in parent task main (UID 7) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (231,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[7 - 7f0d18789c40] 78.553494 {2}{shutdown}: Received notification on node 7 for phase 1
[7 - 7f0d18789c40] 78.555683 {2}{shutdown}: Received notification on node 7 for phase 2
[7 - 7f0d18789c40] 78.603223 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40] 78.647859 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.648946 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.650048 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.651143 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.652244 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.653349 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.654432 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.655527 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 79.308241 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40] 79.309373 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 79.317121 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40] 79.317851 {2}{shutdown}: Received notification on node 7 for phase 4
FWIW, this really strange, it looks like the shutdown process finished in Legion and something is just not shutting down afterwards, but we've at least called Realm shutdown at this point. This is definitely very different than the other shutdown "hang" that is referenced at the beginning of the issue.
Do we have backtraces for this new form of hang?
It could be that we are seeing two different issues, sapling vs blaze. The original stack traces above were from blaze and everything since then has been on sapling.
@lightsighter to run it yourself do:
salloc -N 1 -p cpu --exclusive
cd /scratch2/seshu/legion_s3d_subranks/Ammonia_Cases
./ammonia_job.sh
I will try -level shutdown=2
on blaze and see what that looks like.
On blaze I'm seeing a lot of stuff like this.
run_0.log:
[0 - 15550859ec80] 60.375449 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.428166 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80] 60.428358 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.446234 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.446248 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.452600 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.452736 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.452792 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80] 60.452821 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.470200 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80] 60.470388 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80] 60.486770 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085aac80] 60.486781 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.494913 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.495045 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.495094 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80] 60.495120 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80] 60.508011 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80] 60.508106 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.522047 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.522057 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.528060 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.529228 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.529278 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 15550859ec80] 60.529305 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.539523 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.539537 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.545758 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.545877 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.545925 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 15550859ec80] 60.545943 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.560360 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.560370 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 15550859ec80] 60.565228 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80] 60.566399 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80] 60.566448 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085b6c80] 60.566473 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.580877 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.580888 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.585870 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.585983 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.586031 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80] 60.586055 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.601841 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085aac80] 60.601921 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.616143 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.616153 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 15550859ec80] 60.621079 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80] 60.621206 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80] 60.621255 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80] 60.621275 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80] 60.635582 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085aac80] 60.635593 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.641514 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.641642 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.641692 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
...
run_1.log:
[1 - 15550859ec80] 60.401550 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.434349 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80] 60.446780 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80] 60.450632 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.450682 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.451850 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.455888 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.473159 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.487330 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 15550859ec80] 60.491942 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.493038 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.494186 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.497934 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 15550859ec80] 60.510917 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.522602 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80] 60.527231 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.528330 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.528419 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.531905 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.540241 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 15550859ec80] 60.544910 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.546011 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.546103 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.548511 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80] 60.560907 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80] 60.564389 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.565490 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.565585 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.569012 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.581421 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085aac80] 60.585062 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.586136 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.586233 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.588593 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.604447 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.616687 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085aac80] 60.620228 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.621329 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.621453 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.623816 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80] 60.636127 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80] 60.639635 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.640734 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.640816 {2}{shutdown}: Outstanding message on node 1
...
Do we have backtraces for this new form of hang?
I looked at a hanging run on sapling and there were no interesting backtraces. The main thread in each process was just blocked waiting on Realm::wait_for_shutdown
. I'll try poking at it again.
On blaze I'm seeing a lot of stuff like this.
That is consistent with the backtraces at the beginning of this issue and are the ones we need to figure out what kind of distributed collectable is not being collected using the instructions I gave above.
Heres what I see:
>>> where
#0 Legion::Internal::DistributedCollectable::check_for_downgrade (this=0x154b0684d420, owner=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1008
#1 0x000015554b14ce4f in Legion::Internal::DistributedCollectable::process_downgrade_request (this=0x154b0684d420, owner=12, to_check=Legion::Internal::DistributedCollectable::GLOBAL_REF_STATE) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1099
#2 0x000015554b14cd26 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=0xce14cb0, derez=..., source=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1077
#3 0x000015554b8b373b in Legion::Internal::Runtime::handle_did_downgrade_request (this=0xce14cb0, derez=..., source=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:24721
#4 0x000015554b8848ac in Legion::Internal::VirtualChannel::handle_messages (this=0x154b1aa612e0, num_messages=1, runtime=0xce14cb0, remote_address_space=12, args=0x154aa18e46e0 "", arglen=32) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:12285
#5 0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b1aa612e0, args=0x154aa18e46c4, arglen=52, runtime=0xce14cb0, remote_address_space=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#6 0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b1a96d300, args=0x154aa18e46c0, arglen=60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#7 0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0xce14cb0, args=0x154aa18e46bc, arglen=64) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#8 0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x154aa18e46b0, arglen=68, userdata=0xce2e710, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#9 0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd24a390, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#10 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x154aa18e4190, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#11 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0x4fe3e50, task=0x154aa18e4190) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#12 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4fe3e50) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#13 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x4fe3e50) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#14 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#15 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
>>> p did
$3 = 216172782113786540
>>> where
#0 Legion::Internal::DistributedCollectable::check_for_downgrade (this=0x154b067f8db0, owner=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1008
#1 0x000015554b14ce4f in Legion::Internal::DistributedCollectable::process_downgrade_request (this=0x154b067f8db0, owner=4, to_check=Legion::Internal::DistributedCollectable::GLOBAL_REF_STATE) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1099
#2 0x000015554b14cd26 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=0xce14cb0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1077
#3 0x000015554b8b373b in Legion::Internal::Runtime::handle_did_downgrade_request (this=0xce14cb0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:24721
#4 0x000015554b8848ac in Legion::Internal::VirtualChannel::handle_messages (this=0x154b1a7f7290, num_messages=1, runtime=0xce14cb0, remote_address_space=4, args=0x154a9ddd5e90 "", arglen=32) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:12285
#5 0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b1a7f7290, args=0x154a9ddd5e74, arglen=52, runtime=0xce14cb0, remote_address_space=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#6 0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b1827c9f0, args=0x154a9ddd5e70, arglen=60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#7 0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0xce14cb0, args=0x154a9ddd5e6c, arglen=64) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#8 0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x154a9ddd5e60, arglen=68, userdata=0xce2e490, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#9 0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd249fa0, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#10 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x154a9ddd5940, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#11 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0xae35bc0, task=0x154a9ddd5940) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#12 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xae35bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#13 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xae35bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#14 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#15 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
>>> p did
$4 = 216172782113786612
The hang on sapling is has to do with profiling and exists in the master and control replication branches and doesn't have anything to do with shardrefine.
I will need to investigate why index partition distributed collectables are not being collected.
I pushed a fix for the hang on sapling.
Please pull and try the latest shard refine on blaze. If it is still live-locking in the same way then break at legion_replication.cc:1008
on any node and print out the did of what you hit, compute did & 0xfff % NUMBER_OF_NODES
, go to that node, break on garbage_collection.cc:1188
conditioned on the did
being the same as the one you had before, when you hit it print out current_state
, total_sent_references
, and total_received_references
.
It is shutting down on sapling now, but not on blaze.
On blaze I'm still seeing the live-lock but I never hit the conditioned breakpoint on the second node. I was able to reduce the problem to 8 nodes.
Computing did & 0xfff % 8
I see 0 -> 4, 1 -> 5, 2 -> 6, 3 -> 7, but then none of the nodes 4, 5, 6, 7 ever hit the conditional garbage_collection.cc:1188 breakpoint or a breakpoint i set on garbage_collection.cc:1008 until I continue nodes 0, 1, 2, 3.
When you break on legion_replication.cc:1008
, instead try printing downgrade_owner
and then go to that node and set a conditional breakpoint on legion_replication.cc:1188
with the did
.
I was able to reproduce on sapling. Unfortunately the smallest size was 16 ranks on 2 nodes.
There are some processes on c0001: 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423 and c0002: 1495608, 1495609, 1495610, 1495611, 1495612, 1495613, 1495614, 1495615.
I can't seem to run 16 ranks on 1 node, we end up with processes in the D state on startup, the node gets drained in slurm, and then I have to reboot.
If you want to run it yourself do the following:
salloc -N 2 -p cpu --exclusive
cd /scratch2/seshu/legion_s3d_subranks/Ammonia_Cases
./ammonia_job.sh
I will have to kill my slurm job first in order for you to run it.
There are some processes on c0001: 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423 and c0002: 1495608, 1495609, 1495610, 1495611, 1495612, 1495613, 1495614, 1495615.
The processes seem to be gone and it looks like your job is over.
I can't seem to run 16 ranks on 1 node, we end up with processes in the D state on startup, the node gets drained in slurm, and then I have to reboot.
That needs to be reported to action@cs. It's a failure of NFS.
Pull and try again with the most recent shardrefine
.
It works on 16 nodes on blaze and 24 nodes on perlmutter. I'd like to try to see if we can scale to the full machine.
I'm seeing a shutdown hang with shardrefine. Here are some stack traces: