Ideas On How To Identify "Build Stalls"?

aolszowka commented 4 years ago

Somewhat related to this tool (I can also move it to msbuild if need be) but in working on https://github.com/aolszowka/NUnit.MSBuild I am seeing a lot of "Build Stalls".

By Stall I mean: I'll see the NUnit task kick off (via Process Explorer), and the build do no useful work until the NUnit Process completes, at which time I'll see high core utilization again. This to me indicates a dependency hedge. The worst part is it does not seem to be consistent.

As far as I can tell there shouldn't be a dependency on the project (it is a unit test), but I am trying to understand if there is some way I can use the structured log to better understand if agents (all 16 of them) are getting locked up on some shared resource or project reference.

I feel like I need something like ETW, To do some debugging like this https://randomascii.wordpress.com/2019/10/20/63-cores-blocked-by-seven-instructions/ (Bruce Dawson BTW is a GLADIATOR at this, next to Raymond Chen he's the second person I most want to meet IRL).

I am not yet skilled enough to ETW this thing, so I am hoping for the next best which is MSBuildStructuredLog helping me identify the culprit.

KirillOsenkov commented 4 years ago

See if this can help: https://github.com/rainersigwald/ParallelBuildDebuggingLogger

I want to build it in at some point, but for now it's a separate tool and you can try opening the traces in chrome.

aolszowka commented 4 years ago

Thank you for the link, I have not attempted that Logger yet but if I do I will let you know.

I ended up reevaluating my assumption that there was no dependency hedge on it, and much to my dismay I was lying, I have made improvements to MsBuildProjectReferenceDependencyGraph (specifically https://github.com/aolszowka/MsBuildProjectReferenceDependencyGraph/issues/13) to allow me to see Assembly/PackageReferences. Then working backwards from the graph produced by that tool, looked for places where tests that had dependencies on the NUnit library referenced each other. That particular tool, that crawled the graph created by MSBPRDG, will be uploaded at some point to the NUnit.MSBuild project.

This showed a fairly large hedge (~120 projects) that had existed for several years, that thankfully could be removed. As per murphy the dependent project of course was one of our largest Unit Test assemblies. I am hopeful that this is the cause of the stall.

I would be interested in something that showed a more interactive graph of what MSBuild actually decided the build order was and to show stalls with a parallel build. I am trying to think how to best visualize it. You would need to represent a number of pieces of information:

The Time - Probably as a straight arrow
Each Build Node/Process - Aligned on the Left
Each Project As it Built and Time Scale
How to represent dependencies - I think this is the hard one

I would imagine something like this:

While this shows places where there are gaps in the build nodes doing "useful work" I don't think it really clearly shows why this is the case, some type of lines to indicate dependency might help, but also might get out of control.

I think the other problem would be the limitations of the MSBuild Scheduler, which IIRC without /graph simply walks down a project tree until it can't build anything, scheduling parallel processes as it walks (at least based on the best knowledge we have here https://github.com/dotnet/msbuild/issues/5058).

By no means an easy problem to solve and I have no good ideas off the top of my head.

KirillOsenkov / MSBuildStructuredLog

Ideas On How To Identify "Build Stalls"? #383