Tom94 / tev

High dynamic range (HDR) image viewer for graphics people
BSD 3-Clause "New" or "Revised" License
1.02k stars 86 forks source link

[macOS] Crash when rapidly creating and populating images using IPC #217

Closed iRath96 closed 6 months ago

iRath96 commented 6 months ago

On my system (arm64 macOS 14.2, official tev 1.26 Release), tev occasionally crashes when creating and updating images via IPC in rapid succession. This problem also occurs on some of our students machines (with varying architecture and OS versions).

We first observed this problem with the test script of our rendering framework, which renders many small scenes in rapid succession and streams all of them live to tev (50 scenes á ~512x512 resolution). The following Python script is a minimal example to reproduce this problem:

import numpy as np
import tev

width = 512
height = 512
channel_names = list("RGB")

image = np.ones((width, height, len(channel_names)))
ipc = tev.TevIpc()
with ipc:
    for i in range(100):
        ipc.create_image(f"test_{i}", width, height, channel_names=channel_names)
        ipc.update_image(f"test_{i}", image, channel_names=channel_names)

What's making this hard to debug is that this only occurs for the Github releases of tev (not just the most recent, but also older ones). No matter how I compile tev on my system (Debug, Release), it never crashes - and we have observed the same on our students' machines. Also, this issue is highly non-deterministic: Sometimes tev crashes after few images were opened, sometimes it takes a hundred of them. It also seems to depend highly on the image resolution and tiling used.

However, when it crashes, it's apparently always an EXC_BAD_ACCESS in computing the canvas statistics:

* thread #32, stop reason = EXC_BAD_ACCESS (code=1, address=0x4)
  * frame #0: 0x0000000000000004
    frame #1: 0x0000000100096428 tev`tev::ImageCanvas::canvasStatistics()::$_12::operator()() (.resume) + 224
    frame #2: 0x000000010001d9cc tev`std::__1::packaged_task<void ()>::operator()() + 80
    frame #3: 0x00000001000bdd1c tev`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, tev::ThreadPool::startThreads(unsigned long)::$_0>>(void*) + 308
    frame #4: 0x000000018a31a034 libsystem_pthread.dylib`_pthread_start + 136

My current intuition is that perhaps something breaks when a new image is created via IPC while canvas statistics for an older image are still processed. (if you change the script to first create all 100 images, and then populate each of them, tev does not crash)

Let me know if you need any further information, or if you have any suggestions on how I could help investigate this further.

Tom94 commented 6 months ago

Abundant thanks for reporting this -- that's a serious issue. I can reproduce your situation. tev crashes using your Python script on a GitHub binary but not on a locally compiled binary.

Could you try the following .dmg that I just now compiled on my machine and see if it no longer crashes on your end? I've also updated the GitHub release with this one and I'll do the same for future versions.

tev.dmg.zip

Could you please also quickly confirm that when you say "varying architecture and OS versions" you still only mean macOS -- just Intel vs. Arm and, say, Ventura vs. Catalina.

I think I know what the problem is, but this is going into speculation territory: tev makes liberal use of C++20 coroutines in its thread pool code (involving image loading and canvas statistics, as you observe). Apple Clang only recently stabilized coroutines -- and GitHub's CI machines that tev uses to generate its releases have a somewhat old version of that compiler where coroutines are still an "experimental" feature. I suspect a compiler bug somewhere in that area... which your locally compiled binaries (with reasonably recent XCode CLI tools) no longer have. I've had a couple of similar compiler bug situations with coroutines in the past, so wouldn't be too surprised about this.

iRath96 commented 6 months ago

I can confirm that the version of tev you sent no longer crashes on my machine 👍🏻

Sorry for the confusion, yes, I only meant students running macOS (with varying architecture and OS version). We also have many students running Windows and Linux, and none of those were affected, despite running the same test scripts.

Your explanation does make sense -- do you think simply using a newer version in the Github workflows could fix this (e.g., macos-13 instead of macos-11), or could this potentially be problematic for backward compatibility? Doesn't Github's macos-11 image ship with more recent compilers (e.g., open-source clang from llvm@15) that could be tested as well?

(For reference: I was testing on my system with Apple clang version 15.0.0)

Tom94 commented 6 months ago

It's a good suggestion. Backward compatibility wouldn't be affected -- and I've already got a PR ready to merge once Apple Clang 15 is available on macos-13: https://github.com/Tom94/tev/pull/207

Somewhat frustratingly, LLVM Clang 15 is not the same as Apple Clang 15. The version numbers aren't in sync and the former's coroutine support is still just experimental while the latter's is proper. Won't risk that again after this issue. :)

(The build I uploaded above is also Apple Clang 15.0.0 FWIW)

Tom94 commented 6 months ago

Following up: it turns out GitHub's Apple Clang got updated to 14.0.3 somewhere in the past 3 months, which dropped the "experimental" coroutine business. I can't reproduce the crashes with an up-to-date GitHub build anymore. Hooray!

Thanks again for reporting this and providing reproduction steps. Hopefully it'll remain a problem of the past.