dotnet / diagnostics

This repository contains the source code for various .NET Core runtime diagnostic tools and documents.
MIT License
1.18k stars 354 forks source link

Address Diagnostic impact of Pinned Object Heap #1408

Closed noahfalk closed 3 years ago

noahfalk commented 4 years ago

Recently the GC team added a new Pinned Object Heap. This new heap adds a conceptual 5th generation to the existing four: gen0, gen1, gen2, LOH (large object heap), and now POH (pinned object heap). The fact there were exactly four generations has been broadly exposed to diagnostic tools so changing to five has far reaching ramifications. At a minimum we don't want changes to the runtime to break critical tools which requires testing. More ideally we also want any necessary API changes, implementation changes, documentation changes, and coordination with the owners of external libraries/tools so that the .NET diagnostic tools ecosystem remains effective at memory investigations.

Work items:

ghost commented 4 years ago

Tagging subscribers to this area: @tommcdon Notify danmosemsft if you want to be subscribed.

leculver commented 4 years ago

DAC implementation/APIs (#13735, dotnet/runtime#13736 and dotnet/runtime#35258)

@tommcdon : I would like to strongly emphasize we need to fix this in particular before 5.0 RTM. As long as the dac produces correct answers out of its dac/sos interface we can light up a lot of "diagnostic tools of last resort" (sos/clrmd/mex) to solve any issues in servicing that arise from the original feature. Concrete support in SOS and other tools can wait as long as we are sure the dac implementation is solid.

I'd be happy to help test this part of the change via clrmd when work is getting close to finished in the dac side of things.

PerfView - confirm with @VSadov, @Maoni0, and @brianrob that we are good mex/sosex/psscor?

mex and PerfView uses ClrMD for diagnostics. If ClrMD lights up this scenario PerfView will light up too. ClrMD is blocked on this item:

DAC implementation/APIs (#13735, dotnet/runtime#13736 and dotnet/runtime#35258)

psscor likely is blocked on that too, not sure of the status of sosex these days.

Maoni0 commented 4 years ago

@leculver just making sure I understand - you are saying that in order to "light up" POH in ClrMD you need those items done, but today ClrMD is not gonna crash if it's run against 5.0 builds with POH in them, correct? it's important to distinguish those 2 scenarios - it would be bad if a diagnostics tool actually crashes if there are 5 generations instead of 4; but it's not bad if it simply doesn't recognize the existence of the 5th gen therefore cannot give you diagnostics info on it (but you still get info on the previous 4 gens) 'cause that support can be added incrementally.

leculver commented 4 years ago

you are saying that in order to "light up" POH in ClrMD you need those items done, but today ClrMD is not gonna crash if it's run against 5.0 builds with POH in them, correct?

I haven't checked the behavior of 1.1, but at minimum I can tell you that ClrMD 2.0 would fail here during ClrRuntime creation: https://github.com/microsoft/clrmd/blob/a89a02c02c8ffa7da542753e844a67478e2b4def/src/Microsoft.Diagnostics.Runtime/src/Builders/HeapBuilder.cs#L44-L45. That's a simple fix, but I'm not sure if we would fall over elsewhere when we encounter valid objects, which don't lie on the gc heap as far as ClrMD knows. I'd also worry that !verifyheap (and tools similar to it) will fail/report errors when the heap is valid and intact.

My overall takeaway here is "we have no idea how diagnostics tools will react to this change, we should probably test them to make sure there's nothing that's going to fail/throw exceptions".

it's not bad if it simply doesn't recognize the existence of the 5th gen therefore cannot give you diagnostics info on it (but you still get info on the previous 4 gens) 'cause that support can be added incrementally.

@Maoni0: No, my position is the opposite of that. We must fix the dac piece of this before shipping 5.0. Specifically I'm referring to adding support through new ISOSDacInterfaces. I'm fine if some tools (perfview, sos, etc) are incrementally updated, but fixing the dac to enumerate correct values is critical before we ship.

Here's a few reasons why:

  1. Once we ship a build with a change we actually can't incrementally add support for anything that requires dac changes. Shiproom has little patience even for diagnostic only changes. If we don't add dac support for this feature before shipping we will never be able to debug builds in the wild with this generation change.
  2. We need to take time to test functionality to make sure we didn't break existing tools even if we did no diagnostics work. ClrMD 2.0 is one example, !verifyheap is another very important thing that needs testing.
  3. Making diagnostics worse than a set baseline of functionality makes other's devs lives harder. We've had "simple" diagnostics issues cause pri-0 "drop everything and work through the weekend" escalations to CLR. As a rule, we should never ship a build of .Net Core with regressed diagnostic functionality unless we have a really good reason to.

As an example of 3, a few weeks back I got pulled into an Azure IOT escalation because WinDbg, PerfView, and Visual Studio all couldn't root cause answer a simple "managed OOM" crash. None of those tools were working due to 3 unrelated bugs, one in each tool. This was an issue where "!dumpheap -stat" would have solved it, but instead it was a pri-0 escalation to CLR. I got called in on a Friday afternoon to root cause their issue, and thankfully ClrMD processed the dump correctly. If we didn't have at least one diagnostics tool working in their scenario, I'd have had to work through the weekend.

If their problem happened to be large chunks of memory on the pinned object heap we'd be back in that situation of heroics to solve basic problems. Even absent a pri-0 escalation, bad tools means more developers (internal and external) have to ask CLR for help debugging when before they would otherwise be self sufficient.

To be clear, I am fine with incrementally adding support to tools as long as the dac piece works and is well tested. SOS, perfview, clrmd, etc can be fixed later as long as we are 100% confident we can implement support for this feature after we ship without needing to service mscordaccore.

leculver commented 4 years ago

I've marked this issue blocking-release for all of the reasons listed above. I specifically mean the dac portion of this work, and not the entire enumerated list in the top. If sufficiently informed folks disagree, please feel free to remove that tag. (cc @jkotas)

Maoni0 commented 4 years ago

@leculver I agree we should fix DAC since that's a change in the runtime. my comment about "incremental changes" was also for tooling (which includes ClrMd in this case...I just meant everything that's not in the runtime since we can't ship runtime OOB).

tommcdon commented 4 years ago

@davmason is there any work left for this issue?

tommcdon commented 4 years ago

Confirmed with David that all known runtime issues are resolved, moving to the diagnostics repo to track documentation and out of band tooling changes

Maoni0 commented 4 years ago

thank you very much, @davmason!

tommcdon commented 4 years ago

@davmason @brianrob @cshung has reported that POH sizes in perfview are not being reported correctly. This may be related to microsoft/perfview/issues/1146

cshung commented 4 years ago

See https://github.com/microsoft/perfview/pull/1295 that fixes TraceEvent and PerfView with respect to the heap sizes reporting.

tommcdon commented 3 years ago

@davmason can this issue be closed or is there outstanding work?