Open eddy16112 opened 2 years ago
The completion queue test: https://gitlab.com/StanfordLegion/legion/-/jobs/3701536692 I've seen it at least twice: https://gitlab.com/StanfordLegion/legion/-/jobs/3676803388
Do we have a bullet for {realm}: network still not quiescent after 10 attempts
? Failing here:
Here's a new failure for MutexChecker
: https://gitlab.com/StanfordLegion/legion/-/jobs/4482517781
I fixed the attach_file_mini
test.
Just another failure with realm_reductions.cc:
Just another failure with realm_reductions.cc:
FWIW, that failure mode of the network not being quiescent is not specific to that test. I've seen it on lots of different tests.
Here's a fun variation of the mutex checker overflow: https://gitlab.com/StanfordLegion/legion/-/jobs/4802099679
Another one for gcc9_cxx11_release_gasnetex_ucx_regent
:
https://gitlab.com/StanfordLegion/legion/-/jobs/4803853742
Not sure if that's a duplicate
Is this a new failure mode?
Another one for gcc9_cxx11_release_gasnetex_ucx_regent:
That is a known issue in the Legion master branch that is fixed in the control replication branch and is too hard to backport.
Is this a new failure mode?
No, it's already fixed and it was not intermittent.
Have you seen this one before (on latest control_replication
)?
In the latest control_replication
, I'm still seeing:
Have you seen this one before (on latest control_replication)?
Yes, and master and every other branch I've worked on. PMI does not like something in our docker setup.
In the latest control_replication, I'm still seeing:
Try again.
Some clues as to the problems with the PMI setup:
https://stackoverflow.com/questions/23237026/simple-mpi-program-fail-with-large-number-of-processes https://stackoverflow.com/questions/29315216/mpich-example-cpi-generates-error-when-it-runs-on-multiple-fresh-installed-vps
Better to describe machines as IP addresses than host names.
We're setting mpirun -n 2
from https://gitlab.com/StanfordLegion/legion/-/blob/master/.gitlab-ci.yml?ref_type=heads#L312, which I believe will implicitly refer to localhost
. Do you really think that will be an issue with the host name?
I have a suspicion that those failures are actually related to this commit that I made today: https://gitlab.com/StanfordLegion/legion/-/commit/831103cb94264df3f6c7326cd84d3df0b1425b1f It handles the race where you launch Legion on multiple nodes, the whole program runs on one node, and then that node exits even before the other nodes have even finished starting up., so the job scheduler starts trying to tear down the other processes before they're even done. Let's see if those kinds of errors go away.
I am adding two more failures to make sure they get addressed:
We need to audit all reported failures here to see if any still remain
At least 1, 2, and 15 are still occurring since they also show up in the fuzzer tests (see #1745)
Here's a recent CI failure for 2: https://gitlab.com/StanfordLegion/legion/-/jobs/7919090909
Here's a CI failure for 15: https://gitlab.com/StanfordLegion/legion/-/jobs/7919090855
This might be related to 15, but not sure: https://gitlab.com/StanfordLegion/legion/-/jobs/7920864325
runtime/realm/gasnetex/gasnetex_internal.cc:3596: bool Realm::GASNetEXInternal::check_for_quiescence(size_t): Assertion `0' failed.
There's also this, which might be a hang: https://gitlab.com/StanfordLegion/legion/-/jobs/7920864333
HELP! Alarm triggered - likely hang!
@elliottslaughter This is a probably discussed already somewhere else but I recall you have done a "fuzz testing" done a relatively short time ago that exposed a number of bugs. Would you be able to describe what sort of fuzz tester is it? Or perhaps point to a place that has some context on it. I would be open to discuss integrating the fuzz testing for Realm. Either as a standalone tool that we run/maintain ourselves or something derived from what you have already done.
We have noticed some examples crash non-deterministically in the CI, so I plan to use this issue to track them, such that people won't be surprised if they see the same error in their development branch.
Such non-deterministic crashes are difficulty to reproduce. We may need to run multiple containers concurrently to increase the contention on the machine.
Realm::MutexChecker::lock_fail
This one is often seen in GASNetEx CIs, need to keep track if it also happens with GASNet. Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2775315984Realm::ChunkedRecycler<T, CHUNK_SIZE>::~ChunkedRecycler()
At least one GASNetEXEvent is allocated but not freed. This one is often see in GASNetEx CIs. Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2791336163 Here is another issue to track it https://github.com/StanfordLegion/legion/issues/1304FIXED: realm
proc_group
Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2810475096 (Fixed by https://gitlab.com/StanfordLegion/legion/-/merge_requests/784) Crash again in Mac OS: https://gitlab.com/StanfordLegion/legion/-/jobs/4518715817FIXED: crash in
examples/separate_compilation.rg
https://github.com/StanfordLegion/legion/issues/1478 unclear whether this is a crash in the code itself or an issue in the launcher (MPI/PMI) Job log: https://gitlab.com/StanfordLegion/legion/-/jobs/3095304675realm
compqueue
with ucx Here is the job log: https://gitlab.com/StanfordLegion/legion/-/jobs/4241109046legion spy https://gitlab.com/StanfordLegion/legion/-/jobs/4318846145
FIXED attach_file_mini on Mac OS with c++17 https://gitlab.com/StanfordLegion/legion/-/jobs/4433345645
Regent non-deterministic segfaults #1490
Temporary FIXED by removing the cancel_operation: realm
test_profiling
poisoned failure https://gitlab.com/StanfordLegion/legion/-/merge_requests/1077 https://gitlab.com/StanfordLegion/legion/-/jobs/4505374541 https://gitlab.com/StanfordLegion/legion/-/jobs/5715868078 https://gitlab.com/StanfordLegion/legion/-/jobs/5715868078 The following assertion code failed