dotnet / winforms

Windows Forms is a .NET UI framework for building Windows desktop applications.
MIT License
4.35k stars 962 forks source link

Massive GDI (region) leak. Help needed. #11334

Open kirsan31 opened 4 months ago

kirsan31 commented 4 months ago

.NET version

.Net7 and .Net8

Did it work in .NET Framework?

Yes

Did it work in any of the earlier releases of .NET Core or .NET 5+?

We didn't see these problems before .Net7.

Issue description

For several month we are trying to investigate huge GDI (regions) leak in our app. This leak is critical because can reach GDI limit (10k) in one day.

  1. 479 leaked regions due to redrawing (almost all are system calls):
    1 1.csv

  2. 84 leaked regions due to close all opened child mdi forms. Closing all windows is done through the menu, when opened, the child menu is filled with open forms (15 in our case). Big part of it is leaking in MenuStrip -> Control.SetBoundsCore -> SetWindowPos. Call stacks ToolStripDropDownItem.OnDropDownOpened -> EtwWriteTransfer and ToolStripDropDownItem.OnDropDownClosed -> EtwWriteTransfer are full. 01 02 03 04 2.csv The logic in tsmiWidow_DropDownOpened is populate childe DropDownItems with 15 (in this situation) items. The logic in tsmiWidow_DropDownClosed is clear all items previously added:

    while (tmi.DropDownItems.Count > 0) 
    {
    ToolStripItem ti = tmi.DropDownItems[tmi.DropDownItems.Count - 1];
    ti.MouseDown -= ActivateW;
    var img = ti.Image;
    ti.Dispose(); // will remove on dispose
    img?.Dispose();
    }

    There are no managed leaks, all objects were properly deleted (this is not always 100% true I explain below). It is very strange that the leak occurs both when adding and removing elements. In 99% of cases everything works completely correctly. While researching I found a small managed leak here: timer1 timer2 This small managed leak is reproducible and can't lead to such catastrophic consequences. Can easily be fixed with WeakReference here (I will open a PR later): https://github.com/dotnet/winforms/blob/7504692fbb7ba100c5b098ce5c47e161450a98e2/src/System.Windows.Forms/src/System/Windows/Forms/Input/MouseHoverTimer.cs#L10-L11

  3. 800+ (maybe not all of 937 are leaked) leaked regions including points 1. and 2. This is also done through the menu, and one submenu is filled and cleared with 43 items in our case, exactly the same as in point 2. These are all the regions that the performance HUD detected after closing all child windows. In this state (in normal condition) our app consume about 100 GDI objects and 10 of them are regions. And in this situation there were about 5500 GDI objects and 5400 of them are regions. 001 002 003 004 all.csv

OS: Windows 10 Pro for Workstations 22H2.

In conclusion, it seems to me that the problem is somewhere in Winforms, or even in the OS. Any assistance in further investigation is greatly appreciated. 🙏

Steps to reproduce

--

JeremyKuhne commented 4 months ago

@kirsan31 I'm very interested in looking more deeply at this. Unfortunately, I'm tied up for a number of weeks on critical BinaryFormatter work. If you have some mitigations like the WeakReference piece, I'm happy to take a look at PRs there.

When the other work is done, I can try to see what I might be able to find out.

Also, have you been able to repro the same thing with .NET 9?

kirsan31 commented 4 months ago

@JeremyKuhne

Also, have you been able to repro the same thing with .NET 9?

I can’t reproduce this at all (no matter how hard I try). This only happens on a working machine and only during work. Moreover, leaks have never started immediately after launching the application, only after a few days. Therefore, my ability to experiment there is very limited and I cannot use .Net9 :( I will continue my experiments...

@weltkante sorry to bother you. but may be you have some ideas?


I'm tied up for a number of weeks on critical BinaryFormatter work.

By the way, I have a question on this topic that no one has answered yet.

weltkante commented 4 months ago

@weltkante sorry to bother you. but may be you have some ideas?

No problem, unfortunately this is nothing I've come across in the past. So far I've always been able to rely on managed leaks and memory snapshots/dumps to compare, or being able to reproduce the problem locally and do a time travel debug trace for inspecting the unmanaged leaks. Seems like neither is an option for you.

If I had to diagnose this issue I'd probably try to isolate what effects it:

weltkante commented 4 months ago

Oh, and make sure the finalizer thread is not stuck on something (look at a few dumps in a debugger after leaks started and check that the thread is idle or at least differs between dumps). Depending on your tooling finalizable objects may not show up as leaks in your managed analysis, but if the finalizer thread is hanging and can't finalize things anymore that may end up this way.

kirsan31 commented 4 months ago

@weltkante

make a dump of the leaking process from the task manager and check for any unexpected 3rd party dlls that may have injected themselves into the process

Nice case, just checked - everything is ok here. But the probability was extremely low because... A work PC has very high restrictions on installed software and Internet use.

have a second machine being setup and used. if it never happens on another machine the machine may be simply broken or the windows installation has been corrupted if possible consider some sort of virtualization for the second setup for easy transfer between machines. I've been using hyperv based windows sandbox scripts lately to setup isolated applications in tricky cases, but there may be alternatives that are easier

Due to the specifics, there is no way for us to configure either a second physical machine or a virtual one. And users won’t approve of this, it’s easier for them to restart the application every few days :)

Oh, and make sure the finalizer thread is not stuck on something (look at a few dumps in a debugger after leaks started and check that the thread is idle or at least differs between dumps). Depending on your tooling finalizable objects may not show up as leaks in your managed analysis, but if the finalizer thread is hanging and can't finalize things anymore that may end up this way.

There's nothing wrong with that. Because other objects are finalized normally, and most regions too. Also in dumps after GC, ready for finalization objects are empty and nothing extraordinary in dead objects.

Thank you for your attention any way 🙏

kirsan31 commented 4 months ago

A small update on what I found out.

weltkante commented 4 months ago

As a result, native functions leak, very often these are SetWindowPos.

Just as a side note to avoid other people reading this drawing the wrong conclusions: SetWindowPos can trigger a lot messages, including callbacks into managed code, so its unlikely to be the direct cause of the leak

kirsan31 commented 4 months ago

SetWindowPos can trigger a lot messages, including callbacks into managed code, so its unlikely to be the direct cause of the leak

Of course that's not the direct cause of the leak. And SetWindowPos really trigger a lot messages, but no one callback to managed code: image I point this method like the most common last managed method in call stack.

weltkante commented 4 months ago

but no one callback to managed code

Sounds weird, all WinForms controls should, at the very least, go through the managed message handler of the control. And SetWindowPos can (and usually does) trigger resize and redraw logic, both of which can have managed event handlers that need to be dispatched too, even if they are empty.

Anyways, just meant to say that seeing this method as call root doesn't mean the problem is guaranteed to be on the native code.

lonitra commented 1 month ago

@kirsan31 Do you think you could provide consistent repro for us to investigate this?

kirsan31 commented 1 month ago

@kirsan31 Do you think you could provide consistent repro for us to investigate this?

I hope so... Currently the issue still exist (very sporadically) and I can't get the root cause :(

JeremyKuhne commented 1 month ago

@kirsan31 as soon as we get actionable stuff here we can assign it to whatever the current release is.

kirsan31 commented 1 month ago

Once again we were able to directly catch leaks and use the performance HUD. What we found out:

This time I copied (in several approaches) all the stacks for two applications from the performance HUD (If necessary, I will provide them all). But they don’t give anything new - all the stacks are some kind of drawing of a menu/tooltip, etc., which always end with EtwWriteTransfer. Example:

https://github.com/user-attachments/assets/08ce1d7c-a925-45f4-991d-ba0c3105774c

From all this and the fact that before .Net7 such behavior was not observed, I have only two possible assumptions - either this is somehow a Windows bug/corruption (appeared with some kind of system update), or, after all, a regression in .Net.

Does anyone have any other ideas or tips?

P.S. why do all messages go through office component ComponentManager.Microsoft.Office.IMsoComponentManager.FPushMessageLoop? //cc @JeremyKuhne

weltkante commented 1 month ago

P.S. why do all messages go through office component ComponentManager.Microsoft.Office.IMsoComponentManager.FPushMessageLoop?

Thats just an interface for Office/VisualStudio compatibility, the naming is historical, WinForms uses its own implementation if Office/VS is not detected to provide the interface implementation.

JeremyKuhne commented 1 month ago

omponentManager.FPushMessageLoop?

Thats just an interface for Office/VisualStudio compatibility, the naming is historical, WinForms uses its own implementation if Office/VS is not detected to provide the interface implementation.

Note that this will now be turned off by default (.NET 9). It can be turned back on with the "Switch.System.Windows.Forms.EnableMsoComponentManager" switch.

Even with the stub it was a fair amount of overhead for message processing. As we had to rewrite all of our COM for ComWrappers we took the opportunity to simplify the message loop.