Cannot triage test failure due to large crash dump not getting uploaded

RikkiGibson commented 11 months ago

[ ] This issue is blocking
[x] This issue is causing unreasonable pain

We've been seeing consistent failures in this Roslyn Mac test job: https://dev.azure.com/dnceng-public/public/_build/results?buildId=461246&view=ms.vss-test-web.build-test-results-tab&runId=10423536&resultId=103188&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Unfortunately, we're not able to reproduce the failure locally, and aren't sure which test to blame. It would really help to have crash dumps here, but they're not being uploaded due to being too large (~4.6gb, larger than the ~1.6gb limit).

Is there anything we can do to further investigate the failure?

cc @jaredpar

garath commented 11 months ago

Some options were discussed in this thread. Did those have any impact?

jaredpar commented 11 months ago

@garath unfortunately no, been bogged down in a number of issues. The main thread I see that discusses the dump size is this one. There is no conclusion I see in that thread on how to make ulimit workable. There is a runtime thread where they recommend setting a number of environment variables to produce different dump types. That is somethnig we can look into.

At the same time ... it seems wrong that we're doing this vs. Helix doing this. For whatever reason Mac dumps are huge to the point of being incompatible with our Helix queues. Our solution was to cap uploads (which seems net good) but it does not appear that we've done anything to make Mac dumps usable in Helix. Why have we kept the ulimit -c unlimited when it seems it's just not going to produce an actionable artifact? If the runtime recommendations work why didn't Helix just move to doing that vs. having every repo do it individually?

garath commented 11 months ago

unfortunately no, been bogged down in a number of issues

Sorry, just so I'm clear, does this mean Roslyn has had a chance to try these suggestions but they have not been effective in getting a dump for this particular problem, or does this mean that Roslyn has been busy with other (unrelated) issues and so has not had a chance to see if these suggestions help?

garath commented 11 months ago

At the same time ... it seems wrong that we're doing this vs. Helix doing this.

It's good that you opened this issue then to revisit macOS dumps now that some time has passed. Partner teams can hopefully add their learnings. The runtime team's process appears to be working for them, is it something that would work for everyone? (My understanding is that it has trade-offs that might not be appropriate for everyone.)

I'll venture to say that the reason Helix isn't doing this (yet) is simply because no one has yet identified a production-worthy solution. The last time this was visited, ideas were presented, trade-offs were proposed but nothing clear that worked for everyone.

So, what implementations should be considered?

markwilkie commented 11 months ago

Adding @agocke too as he might have insight/opinions.

jaredpar commented 11 months ago

Sorry, just so I'm clear, does this mean Roslyn has had a chance to try these suggestions but they have not been effective in getting a dump for this particular problem, or does this mean that Roslyn has been busy with other (unrelated) issues and so has not had a chance to see if these suggestions help?

Been fighting a few other fires so haven't had a chance to try out the suggestions. Our plan at the moment is to try the environment variable approach and see if that works.

missymessa commented 8 months ago

Closing in lieu of current Runtime efforts to solve dump management.

markwilkie commented 8 months ago

From Jared:

Roslyn and runtime have now spent a week of effort trynig to get around this limitation in Helix. Nothing we do will work cause the moment we touch the system the bug stops reproing. So we are at a point where

We are shipping known .NET runtime crashes in the GC for .NET 8 in OSX
Roslyn has given up testing on OSX because of this bug.

There needs to be a relief valve here. Some flag, setting, etc ... where we can break the limits and get valuable production crashes into the hands of the runtime team. The other outcome is we are effectivelsy content shipping this known crash cause we've exhausted our other avenues of getting the repro.

markwilkie commented 8 months ago

triage - who remembers what actually broke when dump sizes were over 1.6gb? cc/ @ilyas1974 @garath @riarenas

riarenas commented 8 months ago

https://github.com/dotnet/core-eng/issues/15331 and https://github.com/dotnet/core-eng/issues/12275 have details on why we ended up limiting these.

markwilkie commented 8 months ago

Perfect - thanks! there's a discussion going on with @jaredpar , @steveisok, and @agocke about how to handle this. No consensus yet

The crux of it seems to be that with Roslyn+Mac (mostly) there are times when a huge dump upload is necessary. How to handle this as an exception is the open question however....

riarenas commented 8 months ago

There's additional dump info here: https://github.com/dotnet/core-eng/issues/15333

markwilkie commented 8 months ago

The current suggestion from @jaredpar is to have an ENV var to disable limit when needed. e.g. HELIX_OVERRIDE_DUMP_LIMIT=true.

Thoughts? I think we're still in the brain-storming phase of trying to figure out how to get visibly into these gigantic dumps from time to time. (especially on macs)

jaredpar commented 8 months ago

I do understand why the limits exist as a default. Particularly on OSX a "no limit" approach makes it easy for a single bad PR, or a bad test across many PRs, to overwhelm our systems. Essentially uploading so much data that we can't ignore the size anymore.

However, to me the limits should exist to protect us from accidental uploads of this nature. There needs to be a "break glass" in the case we have real dumps for real product issues that we need to get off of the machine sand into the hands of the product team to look at.

Given we use environment variables to control helix today (where to find items, where to put dumps, etc ...) an environment variable seems like a natural approach. Say HELIX_OVERRIDE_DUMP_LIMIT=true that we could set but can understand if we need a different tweak given nature of helix.

riarenas commented 8 months ago

An escape mechanism seems like a good idea.

One concern I'd have and that we'd have to think through is how to make sure a PR that enables this env var or setting, or whatever, doesn't get accidentally checked in and ends up breaking everything really quickly. The previous issues showed that uploading these large mac dumps very easily broke the queues.

Another naive idea. Since these are on-prem machines, is there something we can do to instead keep dumps on disk for a certain amount of time, and allow folks to connect to the machines to extract them after removing the machine from the queue?

riarenas commented 8 months ago

What if we wrote a supplementary script or tool that would:

remove the machine from Helix
get the crash dumps present in the machine for a given job
Put the machine back in Helix

(assuming we keep the dumps for a certain amount of time)

garath commented 8 months ago

What if we wrote a supplementary script or tool that would:

remove the machine from Helix

get the crash dumps present in the machine for a given job

Put the machine back in Helix

(assuming we keep the dumps for a certain amount of time)

This seems very similar to what the client code is doing already. Why not just have it ignore its own size limits if it sees the break-glass environment variable?

jaredpar commented 8 months ago

One concern I'd have and that we'd have to think through is how to make sure a PR that enables this env var or setting, or whatever, doesn't get accidentally checked in and ends up breaking everything really quickly.

The case I had in mind was specifically us checking in the variable. OSX is a CI only job for us (does not run on PR). There are a few reasons for that:

Linux is a good enough proxy for PR level changes
Help reduce contention on the OSX pool of machines

So in this case we'd actually need to check in the override. If it became a contention point we could temporarily move this to run in PRs and just jam the retry button down until it hit though.

What if we wrote a supplementary script or tool that would

That would work on our end. It's very easy for us to see after the fact which runs failed. If it's easy enough ofr you all to yank machines out afterwards that works.

garath commented 8 months ago

The helix client code can add telemetry that measures how often, how big and how much time is spent handling dumps. That can be exposed in a dashboard in Grafana that should make it easy to see if something negative is happening.

riarenas commented 8 months ago

What if we wrote a supplementary script or tool that would:

remove the machine from Helix

get the crash dumps present in the machine for a given job

Put the machine back in Helix

(assuming we keep the dumps for a certain amount of time)

This seems very similar to what the client code is doing already. Why not just have it ignore its own size limits if it sees the break-glass environment variable?

I'm probably too paranoid, but I can see a dev checking in the env variable for PRs (for repos that do run mac tests during PR validation) so I thought having a separate tool that can't be checked in by accident would be helpful.

No qualms about trying the env var approach until the first time we have to deal with a queue breakage in case I'm just being overly careful.

garath commented 8 months ago

In the original FR thread I mentioned this idea:

Here's the best solution I've got that you can implement right now: (1) Compress the dumps, (2) put them in HELIX_WORKITEM_UPLOAD_ROOT. From what I see, files there have no size limit but are subject to a total maximum time and a rate-of-upload minimum.

I'm curious if this was tried and what the result was?

Specifically, one of my concerns is that the problem may not be the upload speed so much as the risk that they fail to upload at all. IIRC this was an issue in early iterations. The former problem is probably managable using only the Helix client code. The latter problem is seems more hardware related and thus more complicated to make reliable.

jaredpar commented 8 months ago

I'm curious if this was tried and what the result was?

We've tried a number fo tricks to reduce the dump size and they all still existed above the helix limits.

markwilkie commented 8 months ago

The safest course of action here seems to be the ability to break glass by sequestering the machine with the HUGE dump on it. (notice I didn't write pillow)

Is that work something that would reasonably fit into our ops rotation?

garath commented 8 months ago

Is that work something that would reasonably fit into our ops rotation?

By default, "yes". If it is discovered to be too big, then we evaluate alternative ways to get the job done (longer rotation, mini-epic, etc.).

In this case, my memory from when I looked at the helix client code for this issue before is that it's a pretty straightforward if/then/else guard. So probably reasonable.

How should we view the priority? I'll check it against the backlog and try to give an idea of when it can get done.

markwilkie commented 8 months ago

I'll check it against the backlog and try to give an idea of when it can get done.

Perfect. And, perhaps the plan will change, but at least we've got an idea to try.

riarenas commented 8 months ago

In this case, my memory from when I looked at the helix client code for this issue before is that it's a pretty straightforward if/then/else guard. So probably reasonable.

I was there because that's how I found the older dump size issues: https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines?path=/resources/helix-scripts/helix/executor.py&version=GBmain&line=746&lineEnd=747&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents

markwilkie commented 7 months ago

Can you confirm @garath and @ilyas1974 that this is up next for triage?

garath commented 7 months ago

I do so confirm.

dotnet / arcade

Cannot triage test failure due to large crash dump not getting uploaded #14201