Tom94 / tev

High dynamic range (HDR) image viewer for graphics people
BSD 3-Clause "New" or "Revised" License
1.02k stars 86 forks source link

memory leak when the network update function is used #146

Closed mmp closed 2 years ago

mmp commented 2 years ago

I'm seeing a memory leak in recent versions of tev when sending images from pbrt running another process. (I've been seeing this for a while now, maybe a few months, but only made a repro case now--apologies.)

The attached repro program (extracted from pbrt, so not minimal!) takes a single command-line argument and connects to a running tev server at that address. It then spams it continuously with image updates. I am seeing two things (this is on OSX, FWIW):

So, between those two, it can add up to gigabytes after a while, especially with a long render running.

mmp commented 2 years ago

I did a little archaeology and the leak seems to go back to at least bb2b088, so it doesn't seem to have been introduced recent.y

Tom94 commented 2 years ago

Inspecting the code, I'm fairly certain I know what it is, thanks for letting me know!

It seems like old color histograms are never cleared -- even when an image that generated them is closed or mutated. I guess back when I implemented the histograms (before image updates over the network were a thing), I thought there would never be more than a couple thousand unique images per tev session. Whoops. :P

(Fix incoming tomorrow. :))

Tom94 commented 2 years ago

To shed a bit more light on why histograms aren't directly associated with their "images" in the first place: there is actually a unique histogram for each combination of image, reference image, error metric, and selected channels -- too many combinations to exhaustively enumerate.

Thus, histograms live in a separate data structure that also handles their caching & lazy computation in the background (to keep things snappy). This makes deletion without data races ever so slightly non-trivial, which is (I presume) why I didn't add it back in the day. Really should have, retrospectively... typical example of "should have done things properly from the start"!

Tom94 commented 2 years ago

If you have a C++20 compiler handy, I would appreciate if you could try the above PR to see if it gets rid of the problem with your test program.

mmp commented 2 years ago

I just tried it (by checking out the cpp20 branch). Unfortunately it still seems to be leaking.

Tom94 commented 2 years ago

Okay, I'll dig some more. Could you try attaching the repro program again? I don't think it got correctly bundled with your original post.

mmp commented 2 years ago

Oops. Look like I forgot to attach it--apologies. Here it is:

tevspew.cpp.txt

mmp commented 2 years ago

I have done some more digging and have some more information. Shortly after starting to send the image, there have been 6 allocations of a total of 150MB with this call stack (out of a total of about 250MB of memory used.)

      6 (150M) CONTENT:  VM: IOSurface
      + 6 (150M) IOSurfaceClientCreate  (in IOSurface) + 0  [0x18d42b4d4]
      +   6 (150M) -[IOSurface initWithProperties:]  (in IOSurface) + 80  [0x18d42b490]
      +     6 (150M) CA::SurfaceUtil::CAIOSurfaceCreate(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long long, CA::SurfaceUtil::SurfaceAlignment, __CFString const*)  (in QuartzCore) + 12264  [0x18bc18ca0]
      +       6 (150M) CA::Render::create_iosurface_with_pixel_format(unsigned int, unsigned int, unsigned int, unsigned int, unsigned long long, __CFString const*)  (in QuartzCore) + 288  [0x18bb61700]
      +         6 (150M) get_unused_drawable(_CAMetalLayerPrivate*, bool, bool)  (in QuartzCore) + 412  [0x18bb61404]
      +           5 (125M) -[CAMetalLayer nextDrawable]  (in QuartzCore) + 1296  [0x18bb60fd4]
      +           ! 4 (100M) nanogui::metal_window_next_drawable(void*)  (in tev) + 88  [0x1043d3f94]  darwin.mm:178
      +           ! : 4 (100M) nanogui::Screen::draw_setup()  (in tev) + 52  [0x1043dafd8]  screen.cpp:573
      +           ! :   4 (100M) nanogui::Screen::draw_all()  (in tev) + 40  [0x1043db0a4]  screen.cpp:618
      +           ! :     2 (50.0M) nanogui::mainloop(float)  (in tev) + 520  [0x1043d43f0]  common.cpp:185
      +           ! :     | 2 (50.0M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 7108  [0x104311ed4]  main.cpp:449
      +           ! :     |   2 (50.0M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +           ! :     |     2 (50.0M) start  (in libdyld.dylib) + 4  [0x1857d1430]
      +           ! :     1 (25.0M) nanogui::Screen::resize_event(nanogui::Array<int, 2ul> const&)  (in tev) + 76  [0x1043db614]  screen.cpp:728
      +           ! :     | 1 (25.0M) nanogui::Screen::resize_callback_event(int, int)  (in tev) + 152  [0x1043dc380]  screen.cpp:906
      +           ! :     |   1 (25.0M) -[GLFWWindowDelegate windowDidResize:]  (in tev) + 236  [0x104390800]  cocoa_window.m:267
      +           ! :     |     1 (25.0M) __CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER__  (in CoreFoundation) + 28  [0x1858a5f08]
      +           ! :     |       1 (25.0M) ___CFXRegistrationPost_block_invoke  (in CoreFoundation) + 52  [0x185947e5c]
      +           ! :     |         1 (25.0M) _CFXRegistrationPost  (in CoreFoundation) + 456  [0x185947dc8]
      +           ! :     |           1 (25.0M) _CFXNotificationPost  (in CoreFoundation) + 720  [0x1858749d8]
      +           ! :     |             1 (25.0M) -[NSNotificationCenter postNotificationName:object:userInfo:]  (in Foundation) + 64  [0x1866003c0]
      +           ! :     |               1 (25.0M) -[NSWindow _setFrameCommon:display:fromServer:]  (in AppKit) + 3156  [0x1880f6ac0]
      +           ! :     |                 1 (25.0M) _glfwPlatformSetWindowSize  (in tev) + 292  [0x104392b60]  cocoa_window.m:1055
      +           ! :     |                   1 (25.0M) tev::ImageViewer::ImageViewer(std::__1::shared_ptr<tev::BackgroundImagesLoader> const&, bool, bool, bool)  (in tev) + 13736  [0x1042ecf1c]  ImageViewer.cpp:455
      +           ! :     |                     1 (25.0M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 6768  [0x104311d80]  main.cpp:433
      +           ! :     |                       1 (25.0M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +           ! :     |                         1 (25.0M) start  (in libdyld.dylib) + 4  [0x1857d1430]
      +           ! :     1 (25.0M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 6792  [0x104311d98]  main.cpp:435
      +           ! :       1 (25.0M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +           ! :         1 (25.0M) start  (in libdyld.dylib) + 4  [0x1857d1430]
      +           ! 1 (25.0M) -[MNVGcontext renderFlush]  (in tev) + 568  [0x1043cae64]  nanovg_mtl.m:1405
      +           !   1 (25.0M) nvgEndFrame  (in tev) + 40  [0x1043a5148]  nanovg.c:396
      +           !     1 (25.0M) nanogui::Screen::initialize(GLFWwindow*, bool)  (in tev) + 624  [0x1043dac44]  screen.cpp:505
      +           !       1 (25.0M) nanogui::Screen::Screen(nanogui::Array<int, 2ul> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, bool, bool, bool, bool, bool, unsigned int, unsigned int)  (in tev) + 976  [0x1043da83c]  screen.cpp:406
      +           !         1 (25.0M) tev::ImageViewer::ImageViewer(std::__1::shared_ptr<tev::BackgroundImagesLoader> const&, bool, bool, bool)  (in tev) + 144  [0x1042e9a04]  ImageViewer.cpp:33
      +           !           1 (25.0M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 6768  [0x104311d80]  main.cpp:433
      +           !             1 (25.0M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +           !               1 (25.0M) start  (in libdyld.dylib) + 4  [0x1857d1430]
      +           1 (25.0M) -[CAMetalLayer nextDrawable]  (in QuartzCore) + 640  [0x18bb60d44]
      +             1 (25.0M) nanogui::metal_window_next_drawable(void*)  (in tev) + 88  [0x1043d3f94]  darwin.mm:178
      +               1 (25.0M) nanogui::Screen::draw_setup()  (in tev) + 52  [0x1043dafd8]  screen.cpp:573
      +                 1 (25.0M) nanogui::Screen::draw_all()  (in tev) + 40  [0x1043db0a4]  screen.cpp:618
      +                   1 (25.0M) nanogui::mainloop(float)  (in tev) + 520  [0x1043d43f0]  common.cpp:185
      +                     1 (25.0M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 7108  [0x104311ed4]  main.cpp:449
      +                       1 (25.0M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +                         1 (25.0M) start  (in libdyld.dylib) + 4  [0x1857d1430]

After a few hide/reveal cycles, total memory use is up to 397MB and there have been 5 more allocations along that call stack, now accounting for 275MB.

      11 (275M) CONTENT:  VM: IOSurface
      + 11 (275M) IOSurfaceClientCreate  (in IOSurface) + 0  [0x18d42b4d4]
      +   11 (275M) -[IOSurface initWithProperties:]  (in IOSurface) + 80  [0x18d42b490]
      +     11 (275M) CA::SurfaceUtil::CAIOSurfaceCreate(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long long, CA::SurfaceUtil::SurfaceAlignment, __CFString const*)  (in QuartzCore) + 12264  [0x18bc18ca0]
      +       11 (275M) CA::Render::create_iosurface_with_pixel_format(unsigned int, unsigned int, unsigned int, unsigned int, unsigned long long, __CFString const*)  (in QuartzCore) + 288  [0x18bb61700]
      +         11 (275M) get_unused_drawable(_CAMetalLayerPrivate*, bool, bool)  (in QuartzCore) + 412  [0x18bb61404]
      +           10 (250M) -[CAMetalLayer nextDrawable]  (in QuartzCore) + 1296  [0x18bb60fd4]
      +           ! 9 (225M) nanogui::metal_window_next_drawable(void*)  (in tev) + 88  [0x1043d3f94]  darwin.mm:178
      +           ! : 9 (225M) nanogui::Screen::draw_setup()  (in tev) + 52  [0x1043dafd8]  screen.cpp:573
      +           ! :   9 (225M) nanogui::Screen::draw_all()  (in tev) + 40  [0x1043db0a4]  screen.cpp:618
      +           ! :     7 (175M) nanogui::mainloop(float)  (in tev) + 520  [0x1043d43f0]  common.cpp:185
      +           ! :     | 7 (175M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 7108  [0x104311ed4]  main.cpp:449
      +           ! :     |   7 (175M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +           ! :     |     7 (175M) start  (in libdyld.dylib) + 4  [0x1857d1430]
      +           ! :     1 (25.0M) nanogui::Screen::resize_event(nanogui::Array<int, 2ul> const&)  (in tev) + 76  [0x1043db614]  screen.cpp:728
      +           ! :     | 1 (25.0M) nanogui::Screen::resize_callback_event(int, int)  (in tev) + 152  [0x1043dc380]  screen.cpp:906
      +           ! :     |   1 (25.0M) -[GLFWWindowDelegate windowDidResize:]  (in tev) + 236  [0x104390800]  cocoa_window.m:267
      +           ! :     |     1 (25.0M) __CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER__  (in CoreFoundation) + 28  [0x1858a5f08]
      +           ! :     |       1 (25.0M) ___CFXRegistrationPost_block_invoke  (in CoreFoundation) + 52  [0x185947e5c]
      +           ! :     |         1 (25.0M) _CFXRegistrationPost  (in CoreFoundation) + 456  [0x185947dc8]
      +           ! :     |           1 (25.0M) _CFXNotificationPost  (in CoreFoundation) + 720  [0x1858749d8]
      +           ! :     |             1 (25.0M) -[NSNotificationCenter postNotificationName:object:userInfo:]  (in Foundation) + 64  [0x1866003c0]
      +           ! :     |               1 (25.0M) -[NSWindow _setFrameCommon:display:fromServer:]  (in AppKit) + 3156  [0x1880f6ac0]
      +           ! :     |                 1 (25.0M) _glfwPlatformSetWindowSize  (in tev) + 292  [0x104392b60]  cocoa_window.m:1055
      +           ! :     |                   1 (25.0M) tev::ImageViewer::ImageViewer(std::__1::shared_ptr<tev::BackgroundImagesLoader> const&, bool, bool, bool)  (in tev) + 13736  [0x1042ecf1c]  ImageViewer.cpp:455
      +           ! :     |                     1 (25.0M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 6768  [0x104311d80]  main.cpp:433
      +           ! :     |                       1 (25.0M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +           ! :     |                         1 (25.0M) start  (in libdyld.dylib) + 4  [0x1857d1430]
      +           ! :     1 (25.0M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 6792  [0x104311d98]  main.cpp:435
      +           ! :       1 (25.0M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +           ! :         1 (25.0M) start  (in libdyld.dylib) + 4  [0x1857d1430]
      +           ! 1 (25.0M) -[MNVGcontext renderFlush]  (in tev) + 568  [0x1043cae64]  nanovg_mtl.m:1405
      +           !   1 (25.0M) nvgEndFrame  (in tev) + 40  [0x1043a5148]  nanovg.c:396
      +           !     1 (25.0M) nanogui::Screen::initialize(GLFWwindow*, bool)  (in tev) + 624  [0x1043dac44]  screen.cpp:505
      +           !       1 (25.0M) nanogui::Screen::Screen(nanogui::Array<int, 2ul> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, bool, bool, bool, bool, bool, unsigned int, unsigned int)  (in tev) + 976  [0x1043da83c]  screen.cpp:406
      +           !         1 (25.0M) tev::ImageViewer::ImageViewer(std::__1::shared_ptr<tev::BackgroundImagesLoader> const&, bool, bool, bool)  (in tev) + 144  [0x1042e9a04]  ImageViewer.cpp:33
      +           !           1 (25.0M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 6768  [0x104311d80]  main.cpp:433
      +           !             1 (25.0M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +           !               1 (25.0M) start  (in libdyld.dylib) + 4  [0x1857d1430]
      +           1 (25.0M) -[CAMetalLayer nextDrawable]  (in QuartzCore) + 640  [0x18bb60d44]
      +             1 (25.0M) nanogui::metal_window_next_drawable(void*)  (in tev) + 88  [0x1043d3f94]  darwin.mm:178
      +               1 (25.0M) nanogui::Screen::draw_setup()  (in tev) + 52  [0x1043dafd8]  screen.cpp:573
      +                 1 (25.0M) nanogui::Screen::draw_all()  (in tev) + 40  [0x1043db0a4]  screen.cpp:618
      +                   1 (25.0M) nanogui::mainloop(float)  (in tev) + 520  [0x1043d43f0]  common.cpp:185
      +                     1 (25.0M) tev::mainFunc(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)  (in tev) + 7108  [0x104311ed4]  main.cpp:449
      +                       1 (25.0M) main  (in tev) + 820  [0x104314bd8]  main.cpp:494
      +                         1 (25.0M) start  (in libdyld.dylib) + 4  [0x1857d1430]

Whether it's a nanogui bug or a bug in tev's usage of nanogui, I have no idea!

(For posterity, this was measured by running tev with the MallocStackLogging env variable set and then running malloc_history [TEV-PID] -callTree -invert -showContent to dump out stack traces of all of the allocations.)

mmp commented 2 years ago

(tev doesn't seem to ever call metal_present_and_release_drawable(); I'm not familiar with nanogui usage, but it seems fishy that it repeatedly acquires them but doesn't release them.)

mmp commented 2 years ago

Actually now I see that it isn't tev calling metal_window_next_drawable but that it's a nanogui class, so nevermind, perhaps.

Tom94 commented 2 years ago

After some additional digging, it seems that merely refreshing the screen is causing the leak. I can make memory usage climb by just opening tev and rapidly moving the mouse.

Image updates seem to be an innocent bystander that just happens to cause extra screen refreshes.

This seems to be related to nanogui's usage of the Metal backend (perhaps even nanovg_metal), just like your stack traces indicate. Thanks for creating these, by the way! When compiling with OpenGL, the leak disappears, but so does EDR support. :(

I've given the nanogui/nanovg Metal code a cursory glance and think I narrowed it down somewhat (see https://github.com/mitsuba-renderer/nanogui/issues/93 ), but this is starting to leave my bubble of expertise. Perhaps Wenzel has an idea.

Tom94 commented 2 years ago

nanogui is fixed now -- I can no longer reproduce the memory leak. Closing. :)

Thanks again for bringing this up and helping so much in diagnosing!

mmp commented 2 years ago

Confirmed fixed here as well!