KirillOsenkov / MSBuildStructuredLog

A logger for MSBuild that records a structured representation of executed targets, tasks, property and item values.
MIT License
1.47k stars 198 forks source link

30 seconds freeze when opening binlog created from VS #739

Closed real-mikhail closed 3 months ago

real-mikhail commented 10 months ago

Binlog is created on empty .NET solution via Project System Tools 2022 extension (Version: 1.0.2.2305901). Binlog size is 60 KB. In the snapshot I see that 30 seconds freeze is happening due to GC which is really weird since there are just 40 MBs of allocations.

Program version: 2.2.155

I'm attaching repro solution, binlog, snapshot (from dotTrace) (on Google Drive) and its screenshot:

its screenshot

KirillOsenkov commented 10 months ago

I can't reproduce this, when you load the binlog and press F5 to reload, is it fast the next time?

We are using Server GC to optimize for loading very large binlogs, but it might create longer GC pauses. Very strange though for such a small binlog.

Could it be the antivirus scanning it? Try excluding a directory from scanning and opening the binlog from that directory.

real-mikhail commented 10 months ago

I can't reproduce this, when you load the binlog and press F5 to reload, is it fast the next time?

No, I still have the similar freeze.

Could it be the antivirus scanning it? Try excluding a directory from scanning and opening the binlog from that directory.

No, that didn't help. I added both folder with LogViewer and folder with binlog to Windows Defender exclusions. I don't have other software which could interfere like that. And I also checked Task Manager - there is nothing suspicious there, no other process is taking CPU resource during the freeze.

I could also note that in the past (like, couple months ago) I didn't see such problem. Though I don't know what version of LogViewer I used back then - there was no major changes in my laptop setup - only regular driver/OS updates and stuff like that.

I did some more profiling and I can see the following issue: Main thread is indeed waiting for GC to be done:

image

While another thread (supposedly the GC thread) is waiting for CPU (waiting to be scheduled by .NET runtime/OS):

image

KirillOsenkov commented 10 months ago

Try opening %localappdata%\MSBuildStructuredLogViewer\app-2.2.158\StructuredLogViewer.exe.config and edit line 15 to disable server GC:

image

See if it changes anything. This is very weird and I don't know how else to investigate this.

KirillOsenkov commented 10 months ago

Also, does it happen on a different machine? Does it happen with different binlogs? Can anyone else reproduce this other than you?

real-mikhail commented 10 months ago

Try opening %localappdata%\MSBuildStructuredLogViewer\app-2.2.158\StructuredLogViewer.exe.config and edit line 15 to disable server GC:

Disabling server GC fixes the issue.

Does it happen with different binlogs?

Yes, it happens with all binlogs and repro is stable.

Can anyone else reproduce this other than you?

I checked it on colleagues machine - it doesn't reproduce on their machine.

We also discovered that setting <GCNoAffinitize enabled="true" /> fixes the issue. So it is definitely an issue with OS scheduling of .NET GC threads.

KirillOsenkov commented 10 months ago

@maoni0 is this something you’d be interested to look into?

real-mikhail commented 10 months ago

Small update from me. I have got the same issue in another .NET application. Though it happens not always - it depends how application is started. If it is started from command line or from IDE - then I can observe the issue. If it is started from another .NET application (via Process.Start, for instance) - then the issue is not reproducible.

image

Still I have no clue why it happens on my machine. I didn't set any system-wide settings. I don't even know which settings could lead to such behaviour.

real-mikhail commented 10 months ago

It is possible that when starting program from another .NET application it inherits some environment variables similar to <GCNoAffinitize enabled="true" />.

Maoni0 commented 10 months ago

if the issue is fixed by setting noaffinitize to true, it usually indicates something else is preventing the GC threads from running. could you capture a ThreadTime trace with perfview that covers that period of time (doesn't have to be the whole 30s, but some of it would do)

perfview /ThreadTime /BufferSizeMB:4000 /CircularMB:4000 /nogui collect

real-mikhail commented 10 months ago

Hi. Sure, please see the attached snapshot: PerfViewData.etl.zip

(I had to disable Windows Defender's real-time protection though, otherwise I was getting "The requested resource is in use" ComException when starting perfview)

Maoni0 commented 10 months ago

thanks for the trace. so both GCs are induced. you can see this if you open the trace in perfview and look at the GCStats view

GC Index Pause Start  Trigger Reason Gen
9 9,452.43 Induced 2NI
10 34,260.24 Induced 2NI

and if you open the Any Stacks view and look at the GC/Triggered stacks you'll see they are triggered by

image image

so you might want to capture a trace to see if you see these induced GCs in the other scenarios.

and the reason why they take so long is there's a very high priority thread (priority 15) from System that runs some nvidia stuff (no symbols so we can't tell what it's actually doing but it's from nvlddmkm. I explained how to debug this kind of issues here: https://github.com/Maoni0/mem-doc/blob/master/doc/.NETMemoryPerformanceAnalysis.md#debugging-a-random-long-gc.

if you are currently using HW acceleration for gfx, you could try to switch to SW see if that helps. but otherwise that is the problem.

real-mikhail commented 10 months ago

Thanks for looking into it!

I tried to use both NVIDIA and Intel GPU (for this process) - the issue is reproducible in both cases. Also I tried to disable "Hardware-accelerated GPU scheduling" but that didn't help either.

I sent a bug report to NVIDIA so hopefully they will take a look at it. They don't have any public bug tracker so I cannot share a link or ticket number here.

I also wonder, if the application enables GC server mode but application is intended to be used on a client machine (with more varied and less controlled environment), whether it makes sense to also set GCNoAffinitize to true? Technically any application may be run with processor affinity set and higher priority - and it should not cause such a serious issues in .NET GC. Setting GCNoAffinitize would make .NET application more proofed to such situations.

Maoni0 commented 10 months ago

that's a valid argument. you could even say that if you specified the GCHeapCount config, GC should just make the default of the GCNoAffinitize config true instead of false. I would say this is really just due to historical reasons - Server GC is already used way less on client machines so we basically just depend on folks who do use it to do their own configuration. for example, VS now uses Server GC and specifies both the GCHeapCount and the GCNoAffinitize config (to true).

btw, I see you have 16 heaps so I'm guessing the GCHeapCount config is actually not specified. it doesn't seem like an app like this should need so many GC heaps.

btw, in a memory constraint scenario, eg, an app running in a container with a memory limit, GCNoAffinitize is by default true. I could change this default when the GCHeapCount config is specified as mentioned above, of course that would make it sort of a breaking change (if you believe perf changes can be categorized as breaking) for folks who did rely on it being false.

KirillOsenkov commented 10 months ago

The reason why I had GC.Collect() there in the first place is that sometimes we open huge binlogs, which make the commit memory grow to 50 GB (so you must have at least 64 GB RAM to open those binlogs on that machine). After a binlog is read, we no longer need a bunch of data structures, but the GC didn't happen immediately, so I was letting it know I'm done allocating and we can free up all that unnecessary memory after loading the binlog. We also found large binlogs open almost twice as fast with Server GC (19 seconds vs. 30 seconds), that's why we turned Server GC on.

real-mikhail commented 10 months ago

btw, I see you have 16 heaps so I'm guessing the GCHeapCount config is actually not specified.

Yes, that parameter is not specified.

We also found large binlogs open almost twice as fast with Server GC (19 seconds vs. 30 seconds), that's why we turned Server GC on.

Is it possible to experiment how changing GCNoAffinitize/GCHeapCount settings would affect the performance of opening those large binlogs? I don't have them, so I cannot check that.

KirillOsenkov commented 3 months ago

Hi, I'm inclined to close this as it's not really a bug in the viewer. Your external machine state was slowing down GC that normally happens fast.