StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
689 stars 144 forks source link

Realm: cancel_operation crashes with profiling #1623

Open eddy16112 opened 10 months ago

eddy16112 commented 10 months ago

I have tried to remove the unstable sleep with events for the test_profiling test. The original code:

    cargs.sleep_useconds = 5000000;
    Event e4 = task_proc.spawn(CHILD_TASK, &cargs, sizeof(cargs), prs);
    sleep(2);
    int info = 111;
    e4.cancel_operation(&info, sizeof(info));
    bool poisoned = false;
    e4.wait_faultaware(poisoned);
    assert(poisoned);

The new one:

    cargs.sleep_useconds = 5000000;
    UserEvent u = UserEvent::create_user_event();
    cargs.wait_on = u;
    UserEvent trigger_event = UserEvent::create_user_event();
    cargs.trigger_event = trigger_event;
    Event e4 = task_proc.spawn(CHILD_TASK, &cargs, sizeof(cargs), prs);
    trigger_event.wait();
    int info = 111;
    e4.cancel_operation(&info, sizeof(info)); // make sure the cancel is called after CHILD_TASK is launched (using trigger_event.wait();), but before it is finished (using u.trigger()).
    u.trigger();
    bool poisoned = false;
    e4.wait_faultaware(poisoned);
    assert(poisoned);

However, a new bug is trigged https://gitlab.com/StanfordLegion/legion/-/jobs/5856518802

test_profiling: /builds/StanfordLegion/legion/runtime/realm/tasks.cc:1189: void Realm::ThreadedTaskScheduler::scheduler_loop(): Assertion `yield_to != Thread::self()' failed.
Signal 6 received by node 1, process 19525 (thread 7f7f6432ec00) - obtaining backtrace
Signal 6 received by process 19525 (thread 7f7f6432ec00) at: stack trace: 9 frames
  [0] = unknown symbol at unknown file:0 [00007f7f895b141f]
  [1] = raise at ../sysdeps/unix/sysv/linux/raise.c:51 [00007f7f88fa100b]
  [2] = abort at /build/glibc-wuryBv/glibc-2.31/stdlib/abort.c:79 [00007f7f88f80858]
  [3] = __assert_fail_base.cold at /build/glibc-wuryBv/glibc-2.31/assert/assert.c:92 [00007f7f88f80728]
  [4] = __assert_fail at /build/glibc-wuryBv/glibc-2.31/assert/assert.c:101 [00007f7f88f91fd5]
  [5] = Realm::ThreadedTaskScheduler::scheduler_loop() at /builds/StanfordLegion/legion/runtime/realm/tasks.cc:1189 [00005591947c59f0]
  [6] = void Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop>(void*) at /builds/StanfordLegion/legion/runtime/realm/threads.inl:97 [00005591947ce94d]
  [7] = Realm::UserThread::uthread_entry() at /builds/StanfordLegion/legion/runtime/realm/threads.cc:1355 [00005591947de1a7]
  [8] = unknown symbol at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 [00007f7f88fb94df]

Here is the PR to reproduce the bug https://gitlab.com/StanfordLegion/legion/-/merge_requests/1049 We decided to disable the cancel_operation test cases, and use this issue to track the bug.

lightsighter commented 10 months ago

However, a new bug is trigged

This seems like a real bug isn't it? That assertion is in the task scheduler and is saying that we're not on the thread that we thought we were on.

eddy16112 commented 10 months ago

Yes, it is a real bug if nothing wrong in my test code. We do not have stress tests for canceling events, so we did not catch the bug before. I just create this issue to reminder us to pick the bug up later.

lightsighter commented 10 months ago

Ok, sounds good.