Closed noahfalk closed 3 years ago
Tagging subscribers to this area: @tommcdon Notify danmosemsft if you want to be subscribed.
DAC implementation/APIs (#13735, dotnet/runtime#13736 and dotnet/runtime#35258)
@tommcdon : I would like to strongly emphasize we need to fix this in particular before 5.0 RTM. As long as the dac produces correct answers out of its dac/sos interface we can light up a lot of "diagnostic tools of last resort" (sos/clrmd/mex) to solve any issues in servicing that arise from the original feature. Concrete support in SOS and other tools can wait as long as we are sure the dac implementation is solid.
I'd be happy to help test this part of the change via clrmd when work is getting close to finished in the dac side of things.
PerfView - confirm with @VSadov, @Maoni0, and @brianrob that we are good mex/sosex/psscor?
mex and PerfView uses ClrMD for diagnostics. If ClrMD lights up this scenario PerfView will light up too. ClrMD is blocked on this item:
DAC implementation/APIs (#13735, dotnet/runtime#13736 and dotnet/runtime#35258)
psscor likely is blocked on that too, not sure of the status of sosex these days.
@leculver just making sure I understand - you are saying that in order to "light up" POH in ClrMD you need those items done, but today ClrMD is not gonna crash if it's run against 5.0 builds with POH in them, correct? it's important to distinguish those 2 scenarios - it would be bad if a diagnostics tool actually crashes if there are 5 generations instead of 4; but it's not bad if it simply doesn't recognize the existence of the 5th gen therefore cannot give you diagnostics info on it (but you still get info on the previous 4 gens) 'cause that support can be added incrementally.
you are saying that in order to "light up" POH in ClrMD you need those items done, but today ClrMD is not gonna crash if it's run against 5.0 builds with POH in them, correct?
I haven't checked the behavior of 1.1, but at minimum I can tell you that ClrMD 2.0 would fail here during ClrRuntime creation: https://github.com/microsoft/clrmd/blob/a89a02c02c8ffa7da542753e844a67478e2b4def/src/Microsoft.Diagnostics.Runtime/src/Builders/HeapBuilder.cs#L44-L45. That's a simple fix, but I'm not sure if we would fall over elsewhere when we encounter valid objects, which don't lie on the gc heap as far as ClrMD knows. I'd also worry that !verifyheap (and tools similar to it) will fail/report errors when the heap is valid and intact.
My overall takeaway here is "we have no idea how diagnostics tools will react to this change, we should probably test them to make sure there's nothing that's going to fail/throw exceptions".
it's not bad if it simply doesn't recognize the existence of the 5th gen therefore cannot give you diagnostics info on it (but you still get info on the previous 4 gens) 'cause that support can be added incrementally.
@Maoni0: No, my position is the opposite of that. We must fix the dac piece of this before shipping 5.0. Specifically I'm referring to adding support through new ISOSDacInterfaces. I'm fine if some tools (perfview, sos, etc) are incrementally updated, but fixing the dac to enumerate correct values is critical before we ship.
Here's a few reasons why:
As an example of 3, a few weeks back I got pulled into an Azure IOT escalation because WinDbg, PerfView, and Visual Studio all couldn't root cause answer a simple "managed OOM" crash. None of those tools were working due to 3 unrelated bugs, one in each tool. This was an issue where "!dumpheap -stat" would have solved it, but instead it was a pri-0 escalation to CLR. I got called in on a Friday afternoon to root cause their issue, and thankfully ClrMD processed the dump correctly. If we didn't have at least one diagnostics tool working in their scenario, I'd have had to work through the weekend.
If their problem happened to be large chunks of memory on the pinned object heap we'd be back in that situation of heroics to solve basic problems. Even absent a pri-0 escalation, bad tools means more developers (internal and external) have to ask CLR for help debugging when before they would otherwise be self sufficient.
To be clear, I am fine with incrementally adding support to tools as long as the dac piece works and is well tested. SOS, perfview, clrmd, etc can be fixed later as long as we are 100% confident we can implement support for this feature after we ship without needing to service mscordaccore.
I've marked this issue blocking-release
for all of the reasons listed above. I specifically mean the dac portion of this work, and not the entire enumerated list in the top. If sufficiently informed folks disagree, please feel free to remove that tag. (cc @jkotas)
@leculver I agree we should fix DAC since that's a change in the runtime. my comment about "incremental changes" was also for tooling (which includes ClrMd in this case...I just meant everything that's not in the runtime since we can't ship runtime OOB).
@davmason is there any work left for this issue?
Confirmed with David that all known runtime issues are resolved, moving to the diagnostics repo to track documentation and out of band tooling changes
thank you very much, @davmason!
@davmason @brianrob @cshung has reported that POH sizes in perfview are not being reported correctly. This may be related to microsoft/perfview/issues/1146
See https://github.com/microsoft/perfview/pull/1295 that fixes TraceEvent and PerfView with respect to the heap sizes reporting.
@davmason can this issue be closed or is there outstanding work?
Recently the GC team added a new Pinned Object Heap. This new heap adds a conceptual 5th generation to the existing four: gen0, gen1, gen2, LOH (large object heap), and now POH (pinned object heap). The fact there were exactly four generations has been broadly exposed to diagnostic tools so changing to five has far reaching ramifications. At a minimum we don't want changes to the runtime to break critical tools which requires testing. More ideally we also want any necessary API changes, implementation changes, documentation changes, and coordination with the owners of external libraries/tools so that the .NET diagnostic tools ecosystem remains effective at memory investigations.
Work items: