dotnet / arcade

Tools that provide common build infrastructure for multiple .NET Foundation projects.
MIT License
658 stars 332 forks source link

Enable crash dumps in AzDo Build pipelines #14440

Open JulieLeeMSFT opened 5 months ago

JulieLeeMSFT commented 5 months ago

Recently, we had an intermittent crash in runtime in VMR build for preview 1, but it took multiple days to reproduce the issue and pinpoint what caused the crash. Since there is no infrastructure currently to get crash dumps in AzDO for build pipelines, it was an extremely painful process.

With a complex build such as VMR, it is essential to make diagnosable system and have crash dumps capability in AzDo builds.

It was especially painful to identity the exact VMR commit that introduced the regression. VMR doesn't have a single commit that corresponds to a single commit from the runtime repo. A commit in the VMR represents a batch of commits, one for each repo flowing into installer. So, it was not possible to simply checkout commits in the VMR to identify the specific offending commit in runtime.

cc @markwilkie @agocke @jkotas @mthalman @MichaelSimons @hoyosjs @tommcdon

garath commented 5 months ago

Is the crash happening during the build or during tests? Which pipeline in particular are you interested in?

Helix does collect dumps so I'd like to get some details to understand why that wasn't working here. Some docs on how it works may be found here: https://github.com/dotnet/arcade/blob/b4e9225c6c2f9da42fbb611a5e8942a08476fe89/Documentation/Dumps/Dumps.md

agocke commented 5 months ago

This is about AzDO builds, so Helix doesn't help.

jkotas commented 5 months ago

Related / partial duplicate: https://github.com/dotnet/dnceng/issues/1290

riarenas commented 5 months ago

Would it help at all if we looked into this feature from 1ES? https://eng.ms/docs/cloud-ai-platform/devdiv/one-engineering-system-1es/1es-docs/1es-hosted-azure-devops-pools/hold-machine-for-debugging?

Extracting dumps is one of the scenarios that is specifically called out.

markwilkie commented 4 months ago

Another one to make sure we consider in triage @ilyas1974 and @garath

dougbu commented 3 months ago

Would it help at all if we looked into this feature from 1ES? https://eng.ms/docs/cloud-ai-platform/devdiv/one-engineering-system-1es/1es-docs/1es-hosted-azure-devops-pools/hold-machine-for-debugging?

Extracting dumps is one of the scenarios that is specifically called out.

Note this option sounds costly because it means all build machines get held for a while after a build completes. It sounds like holding a machine only after a failure may eventually be implemented though using the feature may remain expensive even with that.

missymessa commented 1 month ago

This a feature that would go well if added to the Arcade SDK.