StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
692 stars 144 forks source link

Realm: Non-deterministic crashes of Legion/Realm/Regent examples in the CI #1305

Open eddy16112 opened 2 years ago

eddy16112 commented 2 years ago

We have noticed some examples crash non-deterministically in the CI, so I plan to use this issue to track them, such that people won't be surprised if they see the same error in their development branch.

Such non-deterministic crashes are difficulty to reproduce. We may need to run multiple containers concurrently to increase the contention on the machine.

  1. Realm::MutexChecker::lock_fail This one is often seen in GASNetEx CIs, need to keep track if it also happens with GASNet. Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2775315984

  2. Realm::ChunkedRecycler<T, CHUNK_SIZE>::~ChunkedRecycler() At least one GASNetEXEvent is allocated but not freed. This one is often see in GASNetEx CIs. Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2791336163 Here is another issue to track it https://github.com/StanfordLegion/legion/issues/1304

  3. FIXED: realm proc_group Here is a job log: https://gitlab.com/StanfordLegion/legion/-/jobs/2810475096 (Fixed by https://gitlab.com/StanfordLegion/legion/-/merge_requests/784) Crash again in Mac OS: https://gitlab.com/StanfordLegion/legion/-/jobs/4518715817

  4. FIXED: crash in examples/separate_compilation.rg https://github.com/StanfordLegion/legion/issues/1478 unclear whether this is a crash in the code itself or an issue in the launcher (MPI/PMI) Job log: https://gitlab.com/StanfordLegion/legion/-/jobs/3095304675

  5. realm compqueuewith ucx Here is the job log: https://gitlab.com/StanfordLegion/legion/-/jobs/4241109046

  6. legion spy https://gitlab.com/StanfordLegion/legion/-/jobs/4318846145

  7. FIXED attach_file_mini on Mac OS with c++17 https://gitlab.com/StanfordLegion/legion/-/jobs/4433345645

  8. Regent non-deterministic segfaults #1490

  9. Temporary FIXED by removing the cancel_operation: realm test_profiling poisoned failure https://gitlab.com/StanfordLegion/legion/-/merge_requests/1077 https://gitlab.com/StanfordLegion/legion/-/jobs/4505374541 https://gitlab.com/StanfordLegion/legion/-/jobs/5715868078 https://gitlab.com/StanfordLegion/legion/-/jobs/5715868078 The following assertion code failed

    
    cargs.sleep_useconds = 5000000;
    Event e4 = task_proc.spawn(CHILD_TASK, &cargs, sizeof(cargs), prs);
    sleep(2);
    int info = 111;
    e4.cancel_operation(&info, sizeof(info));
    bool poisoned = false;
    e4.wait_faultaware(poisoned);
    assert(poisoned);


10. realm `deferred_allocs`
https://gitlab.com/StanfordLegion/legion/-/jobs/4505374448

11. realm `event_subsribe` possible related to the inaccurate usleep in containers
https://gitlab.com/StanfordLegion/legion/-/jobs/4513555518
https://gitlab.com/StanfordLegion/legion/-/jobs/5954499303

12. realm `ctxswitch`
https://gitlab.com/StanfordLegion/legion/-/jobs/4521720755
https://gitlab.com/StanfordLegion/legion/-/jobs/6022551375

13. jupyter notebook timeout
https://gitlab.com/StanfordLegion/legion/-/jobs/4521720751

14. another realm compqueue with gasnetex
https://gitlab.com/StanfordLegion/legion/-/jobs/5170623224

15. realm `simple_reduce` network still not quiescent
https://gitlab.com/StanfordLegion/legion/-/jobs/6079379775

16. unknown barrier related, the active message received seems to be incorrect
https://gitlab.com/StanfordLegion/legion/-/jobs/6413356142 

17. realm `evernt_subscribe`
https://gitlab.com/StanfordLegion/legion/-/jobs/6393148865

18. another failure mode with gasnetex during shutting down
https://gitlab.com/StanfordLegion/legion/-/jobs/6770356161
lightsighter commented 1 year ago

The completion queue test: https://gitlab.com/StanfordLegion/legion/-/jobs/3701536692 I've seen it at least twice: https://gitlab.com/StanfordLegion/legion/-/jobs/3676803388

elliottslaughter commented 1 year ago

Do we have a bullet for {realm}: network still not quiescent after 10 attempts? Failing here:

https://gitlab.com/StanfordLegion/legion/-/jobs/4481778384

elliottslaughter commented 1 year ago

Here's a new failure for MutexChecker: https://gitlab.com/StanfordLegion/legion/-/jobs/4482517781

lightsighter commented 1 year ago

I fixed the attach_file_mini test.

apryakhin commented 1 year ago

Just another failure with realm_reductions.cc:

lightsighter commented 1 year ago

Just another failure with realm_reductions.cc:

FWIW, that failure mode of the network not being quiescent is not specific to that test. I've seen it on lots of different tests.

lightsighter commented 1 year ago

Here's a fun variation of the mutex checker overflow: https://gitlab.com/StanfordLegion/legion/-/jobs/4802099679

apryakhin commented 1 year ago

Another one for gcc9_cxx11_release_gasnetex_ucx_regent: https://gitlab.com/StanfordLegion/legion/-/jobs/4803853742

Not sure if that's a duplicate

elliottslaughter commented 1 year ago

Is this a new failure mode?

https://gitlab.com/StanfordLegion/legion/-/jobs/4810462264

lightsighter commented 1 year ago

Another one for gcc9_cxx11_release_gasnetex_ucx_regent:

That is a known issue in the Legion master branch that is fixed in the control replication branch and is too hard to backport.

Is this a new failure mode?

No, it's already fixed and it was not intermittent.

elliottslaughter commented 1 year ago

Have you seen this one before (on latest control_replication)?

https://gitlab.com/StanfordLegion/legion/-/jobs/4811257857

elliottslaughter commented 1 year ago

In the latest control_replication, I'm still seeing:

https://gitlab.com/StanfordLegion/legion/-/jobs/4811257849

lightsighter commented 1 year ago

Have you seen this one before (on latest control_replication)?

Yes, and master and every other branch I've worked on. PMI does not like something in our docker setup.

In the latest control_replication, I'm still seeing:

Try again.

lightsighter commented 1 year ago

Some clues as to the problems with the PMI setup:

https://stackoverflow.com/questions/23237026/simple-mpi-program-fail-with-large-number-of-processes https://stackoverflow.com/questions/29315216/mpich-example-cpi-generates-error-when-it-runs-on-multiple-fresh-installed-vps

Better to describe machines as IP addresses than host names.

elliottslaughter commented 1 year ago

We're setting mpirun -n 2 from https://gitlab.com/StanfordLegion/legion/-/blob/master/.gitlab-ci.yml?ref_type=heads#L312, which I believe will implicitly refer to localhost. Do you really think that will be an issue with the host name?

lightsighter commented 1 year ago

I have a suspicion that those failures are actually related to this commit that I made today: https://gitlab.com/StanfordLegion/legion/-/commit/831103cb94264df3f6c7326cd84d3df0b1425b1f It handles the race where you launch Legion on multiple nodes, the whole program runs on one node, and then that node exits even before the other nodes have even finished starting up., so the job scheduler starts trying to tear down the other processes before they're even done. Let's see if those kinds of errors go away.

apryakhin commented 7 months ago

I am adding two more failures to make sure they get addressed:

apryakhin commented 2 months ago

We need to audit all reported failures here to see if any still remain

lightsighter commented 2 months ago

At least 1, 2, and 15 are still occurring since they also show up in the fuzzer tests (see #1745)

elliottslaughter commented 2 months ago

Here's a recent CI failure for 2: https://gitlab.com/StanfordLegion/legion/-/jobs/7919090909

Here's a CI failure for 15: https://gitlab.com/StanfordLegion/legion/-/jobs/7919090855

elliottslaughter commented 2 months ago

This might be related to 15, but not sure: https://gitlab.com/StanfordLegion/legion/-/jobs/7920864325

runtime/realm/gasnetex/gasnetex_internal.cc:3596: bool Realm::GASNetEXInternal::check_for_quiescence(size_t): Assertion `0' failed.
elliottslaughter commented 2 months ago

There's also this, which might be a hang: https://gitlab.com/StanfordLegion/legion/-/jobs/7920864333

HELP!  Alarm triggered - likely hang!
apryakhin commented 1 month ago

@elliottslaughter This is a probably discussed already somewhere else but I recall you have done a "fuzz testing" done a relatively short time ago that exposed a number of bugs. Would you be able to describe what sort of fuzz tester is it? Or perhaps point to a place that has some context on it. I would be open to discuss integrating the fuzz testing for Realm. Either as a standalone tool that we run/maintain ourselves or something derived from what you have already done.

elliottslaughter commented 1 month ago

1745 is the fuzzer-specific issue, maybe I'll answer over there since this thread is already quite long?