apache / celix

Apache Celix is a framework for C and C++14 to develop dynamic modular software applications using component and in-process service-oriented programming.
https://celix.apache.org/
Apache License 2.0
166 stars 87 forks source link

Steady thread condition wait for OSX #587

Open pnoltes opened 1 year ago

pnoltes commented 1 year ago

For Linux, the celix functions celixThreadCondition_waitUntil, celixThreadCondition_timedwaitRelative, celixThreadCondition_getTime and celixThreadCondition_getDelayedTime are based on a monotonic time to prevent unsteady behavior due to OS sleeps or time updates. This is possible because the celixThreadCondition_init function initializes the underlining thread condition with time MONOTONIC.

For OSX this is not possible. Currently for OSX a relative sleep is calculated and a call to pthread_cond_timedwait_relative_np is used. Based on test result this seems unsteady, but further investigation is needed. There is also not a lot of documentation for pthread_cond_timedwait_relative_np. Does it even return a ETIMEDOUT code?

If this is updated the following test can be reenabled: ScheduledEventTestSuite, ScheduledEventTimeoutLogTest

And the error margin in ScheduledEventTestSuite and ThreadsTestSuite can be made stricter.

PengZheng commented 1 year ago

Does it even return a ETIMEDOUT code?

Yes, it does. I found it used in several high profile projects including musl/chromium. https://chromium.googlesource.com/v8/v8/+/62f929ff4cf322bc05e3a384c4ad18005f50ebda/src/base/platform/condition-variable.cc#81

It's similar to pthread_cond_timedwait but with relative timeout.

http://newosxbook.com/ChangeLog.html

https://books.google.com/books?id=K8vUkpOXhN4C&pg=PA760&lpg=PA760&dq=macOS+detailed+explanation+of+pthread_cond_timedwait_relative_np&source=bl&ots=ONpoQU_qXD&sig=ACfU3U0vI0QIGQa-wubBjj_t73f0N3P7FQ&hl=en&sa=X&ved=2ahUKEwii9dSykIiAAxX8EEQIHVGgDxEQ6AF6BAg5EAM#v=onepage&q=macOS%20detailed%20explanation%20of%20pthread_cond_timedwait_relative_np&f=false

In my own experience, testing for timing related to task scheduling is always a challenging task, especially in cloud/container environment. We must take OS scheduling (in CI environment, we have both guest and host scheduling) into account when designing such tests.

The following test failure can serve as an evidence:

https://github.com/apache/celix/actions/runs/5525141973/jobs/10078372242#step:7:3293

/home/runner/work/celix/celix/libs/framework/gtest/src/ScheduledEventTestSuite.cc:627: Failure
Value of: success
  Actual: false
Expected: true
/home/runner/work/celix/celix/libs/framework/gtest/src/ScheduledEventTestSuite.cc:630: Failure
Expected equality of these values:
  1
  count.load()
    Which is: 0
/home/runner/work/celix/celix/libs/framework/gtest/src/ScheduledEventTestSuite.cc:642: Failure
Value of: success
  Actual: false
Expected: true
/home/runner/work/celix/celix/libs/framework/gtest/src/ScheduledEventTestSuite.cc:645: Failure
Expected equality of these values:
  2
  count.load()
    Which is: 0
[2023-07-11T21:47:21] [  debug] [celix_framework] [celix_framework_processScheduledEvents:1484] Removing processed scheduled event '' (id=1) for bundle if 0.
[2023-07-11T21:47:21] [  trace] [celix_framework] [celix_framework_shutdownAsync:1233] Start shutdown thread for framework afe4a4e1-402e-4f97-ac11-e008ec230110
[2023-07-11T21:47:21] [  trace] [celix_framework] [celix_bundleContext_cleanup:115] Cleaning up bundle context `celix_framework` (id=0)
[2023-07-11T21:47:21] [  trace] [celix_framework] [framework_shutdown:1186] Celix framework shutting down
[2023-07-11T21:47:21] [  trace] [celix_framework] [celix_framework_stopAndJoinEventQueue:1171] Stop and joining event loop thread for framework afe4a4e1-402e-4f97-ac11-e008ec230110
[2023-07-11T21:47:21] [  debug] [celix_framework] [celix_framework_stopAndJoinEventQueue:1180] Joined event loop thread for framework afe4a4e1-402e-4f97-ac11-e008ec230110
[  FAILED  ] ScheduledEventTestSuite.CxxWaitForScheduledEvent (423 ms)

I don't know how macOS and Linux environments are managed by GitHub actions. The different results we obtained are more likely due to environment difference.

PengZheng commented 1 year ago

The following comment confused me:

    //When removing the event
    celix_bundleContext_removeScheduledEvent(fw->getFrameworkBundleContext()->getCBundleContext(), eventId);

    //Then the log callback is called at least one more time with a warning log message, because remove
    //callback took too long
    //(note the logCount can be increased more than once, due to another processing thread)
    EXPECT_GE(logCount.load(), 2);

Why a time-consuming remove callback is guaranteed to trigger a warning, given that we never track the duration like we do in celix_scheduledEvent_process?

void celix_scheduledEvent_setRemoved(celix_scheduled_event_t* event) {
    if (event->removedCallback) {
        event->removedCallback(event->removedCallbackData);
    }
    celixThreadMutex_lock(&event->mutex);
    event->isRemoved = true;
    celixThreadCondition_broadcast(&event->cond); // for changed isRemoved
    celixThreadMutex_unlock(&event->mutex);
}
// celix_scheduledEvent_process
    struct timespec start = celix_gettime(CLOCK_MONOTONIC);
    event->callback(event->callbackData); // note called outside of lock
    struct timespec end = celix_gettime(CLOCK_MONOTONIC);

It seems to me that either the last check is unnecessary or our implementation of celix_scheduledEvent_setRemoved is incorrect.

IMHO, ScheduledEventTestSuite.CxxWaitForScheduledEvent failure is more worrisome, since it can fail on all platforms with various possibilities.