SRGSSR / pillarbox-apple

A next-generation reactive media playback ecosystem for Apple platforms.
https://testflight.apple.com/join/TS6ngLqf
MIT License
50 stars 7 forks source link

Sometimes stuck unit tests #353

Closed defagos closed 1 year ago

defagos commented 1 year ago

As a Pillarbox developer I need a reliable CI to have a smooth workflow.

Acceptance criteria

Hints

Sometimes unit tests get stuck, either locally or on CI agents. This issue appeared after we started work on comScore.

The issue was always appearing when updating Swift packages in the logs so we initially thought this could have a relationship with network issues we recently had. We were wrong.

The issue occurs consistently when packages are updated because it occurs at the end of a package test target execution. If we pause debugging when such hangs occur on a local development machine we can see that all tests passed but that execution is stuck, preventing the test target from exiting properly. Checking the stack trace it is likely this issue is related to a deadlock which might occur during comScore cleanup:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000100dcfce8 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000100e34888 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 80
    frame #2: 0x0000000100e32398 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 244
    frame #3: 0x0000000106556a04 ComScore`ComScore::CriticalSection::enter() const + 12
    frame #4: 0x00000001064dc154 ComScore`ComScore::TaskExecutor::shutdown() + 108
    frame #5: 0x00000001064efba0 ComScore`ComScore::Core::shutDown() + 24
    frame #6: 0x0000000106513ff4 ComScore`ComScore::CoreContainer::~CoreContainer() + 24
    frame #7: 0x000000018010aeb4 libsystem_c.dylib`__cxa_finalize_ranges + 424
    frame #8: 0x000000018010b248 libsystem_c.dylib`exit + 28
    frame #9: 0x0000000102c95fe4 XCTestCore`_XCTestMain + 104
    frame #10: 0x0000000100b4c868 xctest`main + 156
    frame #11: 0x0000000100d09fa0 dyld_sim`start_sim + 20
    frame #12: 0x0000000100ea9e50 dyld`start + 2544

The reason why this issue sometimes consistently appears and sometimes not is yet unclear. Some fix ideas:

Tasks

defagos commented 1 year ago

I opened a branch on which I will be able to play a bit with our project setup and get quick feedback from CI agents whether this fixes anything or not.

If we disable analytics tests in the Pillarbox-Package-NoDebug scheme tests run fine, otherwise they quite surely get stuck on our CI agents.

defagos commented 1 year ago

With parallel testing disabled for the analytics test target the result is no better but the log is more readable and CI gets stuck earlier and in a way more identifiable in the logs. So parallelization removal might be helpful during investigations.

We can also disable other test targets to isolate the analytics target and spare some time when running test CI jobs.

defagos commented 1 year ago

Maybe talking to a former self might help.

defagos commented 1 year ago

Here is the full stack trace for all threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000104d33ce8 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000104ebc888 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 80
    frame #2: 0x0000000104eba398 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 244
    frame #3: 0x000000010a48ea04 ComScore`ComScore::CriticalSection::enter() const + 12
    frame #4: 0x000000010a414154 ComScore`ComScore::TaskExecutor::shutdown() + 108
    frame #5: 0x000000010a427ba0 ComScore`ComScore::Core::shutDown() + 24
    frame #6: 0x000000010a44bff4 ComScore`ComScore::CoreContainer::~CoreContainer() + 24
    frame #7: 0x000000018010aeb4 libsystem_c.dylib`__cxa_finalize_ranges + 424
    frame #8: 0x000000018010b248 libsystem_c.dylib`exit + 28
    frame #9: 0x0000000106a55fe4 XCTestCore`_XCTestMain + 104
    frame #10: 0x0000000104850868 xctest`main + 156
    frame #11: 0x0000000104a0dfa0 dyld_sim`start_sim + 20
    frame #12: 0x0000000104c45e50 dyld`start + 2544
  thread #4
    frame #0: 0x0000000104d32dec libsystem_kernel.dylib`__workq_kernreturn + 8
  thread #5
    frame #0: 0x0000000104d32dec libsystem_kernel.dylib`__workq_kernreturn + 8
  thread #6
    frame #0: 0x0000000104d32dec libsystem_kernel.dylib`__workq_kernreturn + 8
  thread #7, name = 'ComScore-Thread'
    frame #0: 0x0000000104d32edc libsystem_kernel.dylib`__ulock_wait + 8
    frame #1: 0x000000018013605c libdispatch.dylib`_dlock_wait + 52
    frame #2: 0x0000000180135e0c libdispatch.dylib`_dispatch_thread_event_wait_slow + 52
    frame #3: 0x0000000180146218 libdispatch.dylib`__DISPATCH_WAIT_FOR_QUEUE__ + 392
    frame #4: 0x0000000180145d18 libdispatch.dylib`_dispatch_sync_f_slow + 160
    frame #5: 0x000000010a469d20 ComScore`ComScore::Core::getCurrentHostApplicationState() + 220
    frame #6: 0x000000010a450874 ComScore`std::__1::__function::__func<ComScore::Core::start(ComScore::String, long long)::$_8, std::__1::allocator<ComScore::Core::start(ComScore::String, long long)::$_8>, void (ComScore::Task*)>::operator()(ComScore::Task*&&) + 2000
    frame #7: 0x000000010a413634 ComScore`ComScore::Task::run() + 44
    frame #8: 0x000000010a44be8c ComScore`ComScore::ContextTasksThread<ComScore::Core>::run() + 204
    frame #9: 0x000000010a48b1e0 ComScore`ComScore::Thread::threadEntryPoint() + 104
    frame #10: 0x000000010a48f570 ComScore`threadEntryProc + 32
    frame #11: 0x0000000104ebf4e4 libsystem_pthread.dylib`_pthread_start + 116
  thread #8, name = 'ComScore-NetworkThread'
    frame #0: 0x0000000104d347b0 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000104ebfacc libsystem_pthread.dylib`_pthread_cond_wait + 1264
    frame #2: 0x000000010a48a9a0 ComScore`ComScore::WaitableEvent::wait(int) const + 152
    frame #3: 0x000000010a44bf08 ComScore`ComScore::ContextTasksThread<ComScore::Core>::run() + 328
    frame #4: 0x000000010a48b1e0 ComScore`ComScore::Thread::threadEntryPoint() + 104
    frame #5: 0x000000010a48f570 ComScore`threadEntryProc + 32
    frame #6: 0x0000000104ebf4e4 libsystem_pthread.dylib`_pthread_start + 116
  thread #10
    frame #0: 0x0000000000000000
  thread #11
    frame #0: 0x0000000000000000
  thread #12, name = 'com.apple.NSURLConnectionLoader'
    frame #0: 0x0000000104d312ac libsystem_kernel.dylib`mach_msg2_trap + 8
    frame #1: 0x0000000104d42154 libsystem_kernel.dylib`mach_msg2_internal + 76
    frame #2: 0x0000000104d394b8 libsystem_kernel.dylib`mach_msg_overwrite + 536
    frame #3: 0x0000000104d3161c libsystem_kernel.dylib`mach_msg + 20
    frame #4: 0x00000001803708ac CoreFoundation`__CFRunLoopServiceMachPort + 156
    frame #5: 0x000000018036aff4 CoreFoundation`__CFRunLoopRun + 1152
    frame #6: 0x000000018036a75c CoreFoundation`CFRunLoopRunSpecific + 584
    frame #7: 0x0000000183fd40fc CFNetwork`___lldb_unnamed_symbol13283 + 344
    frame #8: 0x0000000180bf29d8 Foundation`__NSThread__start__ + 704
    frame #9: 0x0000000104ebf4e4 libsystem_pthread.dylib`_pthread_start + 116
defagos commented 1 year ago

Even if keeping only ComScorePageViewTests to keep a single comScore-related test only the issue still arises. A few interesting facts:

In fact this seems related to how tests are run since the following code is affected by the same issue (provided the necessary private implementation details are exposed):

func testSimple() {
    URLSession.toggleInterceptor()
    expectation(forNotification: .didReceiveComScoreRequest, object: nil)
    Analytics.shared.sendPageView(title: "title")
    waitForExpectations(timeout: 10)
}
defagos commented 1 year ago

Here is a test project with which the issue can be reproduced. This project:

To reproduce the issue simply run the test suite (cmd+U) until tests hang, at which point pausing execution will reveal the above trace.

We can likely use it to file a bug report or maybe find a workaround.

defagos commented 1 year ago

Quick summary of investigations so far:

We can continue our investigations using the test project above and maybe find a solution or file a report. Otherwise we might need to find a way to temporarily avoid starting the comScore SDK on our CI, as having stuck jobs is incompatible with the concept of a CI.

defagos commented 1 year ago

Tested but not working:

defagos commented 1 year ago

We have investigated this issue by reducing analytics tests to the minimum. Fortunately only running the AnalyticsTests target or all Pillarbox tests delivers the same results. Without loss of generality we could therefore focus on a single target.

Removing tests we noticed that having a single comScore test for page views in a single test case works fine. Just adding a CommandersAct event test in a parallel test suite fails, though, which is how we could deduce that the issue may appear when a test suite can end without comScore having ended its startup phase, i.e. before the start event is sent.

An obvious mitigation is therefore to always wait for the start event at the beginning of a test suite, especially if this test suite is not testing comScore behavior (as testing comScore behavior only implies the start event has been emitted first). A simple implementation was tested on our CI agents and could be run successfully without deadlocks several dozens times.

defagos commented 1 year ago

A mitigation has been implemented as part of #344 since this branch was very difficult to merge due to the above problem (the likelihood of having unit tests get stuck was very high, preventing status checks from ending correctly).

defagos commented 1 year ago

If comScore SDK updates are delivered we can check whether the deadlock has been fixed as follows:

  1. Create a test branch and update comScore SDK.
  2. Remove the waitUntil in TestCase.setUp().
  3. Check that all tests run successfully locally.
  4. Push the branch and repeatedly schedule UT jobs. With ~20 jobs passing without getting stuck we can consider the issue has been fixed.

If the issue has been fixed we can remove the workaround implemented in #344, i.e.:

  1. Update ComScoreInterceptor.start() to not expect a completion closure anymore (must only setup the URLSession interceptor). Merge ComScoreInterceptor.labelsPublisher() into ComScoreInterceptor.eventPublisher(for:) implementation and remove private vars.
  2. Call ComScoreInterceptor.start() from AnalyticsListener.captureComScoreEvents(perform:) only, before starting event capture.
  3. Remove AnalyticsListener.start(completion:) and TestCase.setUp() entirely.

Here is a patch containing these changes.

defagos commented 1 year ago

Failure conditions for archiving, code quality, documentation and UTs have been configured so that jobs fail after 5 minutes. This seems quite aggressive but should be enough.

defagos commented 1 year ago

We have something that works fine enough for now, even without time limits. Reporting a reproducible scenario to comScore is likely difficult as well. Closed for now.