dotnet / arcade

Tools that provide common build infrastructure for multiple .NET Foundation projects.
MIT License
671 stars 349 forks source link

Enable crash dumps in AzDo Build pipelines #14440

Open JulieLeeMSFT opened 10 months ago

JulieLeeMSFT commented 10 months ago

Recently, we had an intermittent crash in runtime in VMR build for preview 1, but it took multiple days to reproduce the issue and pinpoint what caused the crash. Since there is no infrastructure currently to get crash dumps in AzDO for build pipelines, it was an extremely painful process.

With a complex build such as VMR, it is essential to make diagnosable system and have crash dumps capability in AzDo builds.

It was especially painful to identity the exact VMR commit that introduced the regression. VMR doesn't have a single commit that corresponds to a single commit from the runtime repo. A commit in the VMR represents a batch of commits, one for each repo flowing into installer. So, it was not possible to simply checkout commits in the VMR to identify the specific offending commit in runtime.

cc @markwilkie @agocke @jkotas @mthalman @MichaelSimons @hoyosjs @tommcdon

garath commented 10 months ago

Is the crash happening during the build or during tests? Which pipeline in particular are you interested in?

Helix does collect dumps so I'd like to get some details to understand why that wasn't working here. Some docs on how it works may be found here: https://github.com/dotnet/arcade/blob/b4e9225c6c2f9da42fbb611a5e8942a08476fe89/Documentation/Dumps/Dumps.md

agocke commented 10 months ago

This is about AzDO builds, so Helix doesn't help.

jkotas commented 10 months ago

Related / partial duplicate: https://github.com/dotnet/dnceng/issues/1290

riarenas commented 9 months ago

Would it help at all if we looked into this feature from 1ES? https://eng.ms/docs/cloud-ai-platform/devdiv/one-engineering-system-1es/1es-docs/1es-hosted-azure-devops-pools/hold-machine-for-debugging?

Extracting dumps is one of the scenarios that is specifically called out.

markwilkie commented 9 months ago

Another one to make sure we consider in triage @ilyas1974 and @garath

dougbu commented 8 months ago

Would it help at all if we looked into this feature from 1ES? https://eng.ms/docs/cloud-ai-platform/devdiv/one-engineering-system-1es/1es-docs/1es-hosted-azure-devops-pools/hold-machine-for-debugging?

Extracting dumps is one of the scenarios that is specifically called out.

Note this option sounds costly because it means all build machines get held for a while after a build completes. It sounds like holding a machine only after a failure may eventually be implemented though using the feature may remain expensive even with that.

missymessa commented 6 months ago

This a feature that would go well if added to the Arcade SDK.

ericstj commented 4 months ago

Adding a ref-count to this. It would have been super useful for the recent bug we were chasing in 9.0 where it only reproduced in the build and not any of the tests (because it involved R2R and long lived process / stress).

With the latest built-in crash support in .NET I think we could make this work for everyone by having arcade set it up at the build entrypoint, and have arcade templates ensure they pulled the dumps from artifacts. @hoyosjs @ellahathaway

Here's the way @ellahathaway was doing this for VMR: https://github.com/dotnet/sdk/pull/42320

I think that could be generalized, and the rough edges (like crossgen error) fixed.

hoyosjs commented 4 months ago

I have been thinking about this - but I am not sure dumps uploaded as artifacts is a good idea in internal builds. Those machines are filled with secrets and the dump would need to go through a compliant pipeline. Perhaps for PRs + testing this is OK.

ericstj commented 4 months ago

I think that's a concern for logs from the official build machines already. I'm not sure we can say that build outputs from official builds will never have secrets - we just need to make sure that they land in a place that's secure. I wonder if AzDo has a way to classify the outputs of the build to make some things more sensitive than others. Maybe some outputs can require some sort of JIT elevation to access.

I agree that we get some good coverage with CI and PR validation, but I don't want us to be shy about considering the log problem from official builds. There will always be problems unique to official builds.

hoyosjs commented 4 months ago

Secrets are usually env settings that get loaded in the msbuild processes that crash. binlogs now get scrubbed - I am hesitant given the history of dumps and secrets.