Open RikkiGibson opened 11 months ago
Some options were discussed in this thread. Did those have any impact?
@garath unfortunately no, been bogged down in a number of issues. The main thread I see that discusses the dump size is this one. There is no conclusion I see in that thread on how to make ulimit
workable. There is a runtime thread where they recommend setting a number of environment variables to produce different dump types. That is somethnig we can look into.
At the same time ... it seems wrong that we're doing this vs. Helix doing this. For whatever reason Mac dumps are huge to the point of being incompatible with our Helix queues. Our solution was to cap uploads (which seems net good) but it does not appear that we've done anything to make Mac dumps usable in Helix. Why have we kept the ulimit -c unlimited
when it seems it's just not going to produce an actionable artifact? If the runtime recommendations work why didn't Helix just move to doing that vs. having every repo do it individually?
unfortunately no, been bogged down in a number of issues
Sorry, just so I'm clear, does this mean Roslyn has had a chance to try these suggestions but they have not been effective in getting a dump for this particular problem, or does this mean that Roslyn has been busy with other (unrelated) issues and so has not had a chance to see if these suggestions help?
At the same time ... it seems wrong that we're doing this vs. Helix doing this.
It's good that you opened this issue then to revisit macOS dumps now that some time has passed. Partner teams can hopefully add their learnings. The runtime team's process appears to be working for them, is it something that would work for everyone? (My understanding is that it has trade-offs that might not be appropriate for everyone.)
I'll venture to say that the reason Helix isn't doing this (yet) is simply because no one has yet identified a production-worthy solution. The last time this was visited, ideas were presented, trade-offs were proposed but nothing clear that worked for everyone.
So, what implementations should be considered?
Adding @agocke too as he might have insight/opinions.
Sorry, just so I'm clear, does this mean Roslyn has had a chance to try these suggestions but they have not been effective in getting a dump for this particular problem, or does this mean that Roslyn has been busy with other (unrelated) issues and so has not had a chance to see if these suggestions help?
Been fighting a few other fires so haven't had a chance to try out the suggestions. Our plan at the moment is to try the environment variable approach and see if that works.
Closing in lieu of current Runtime efforts to solve dump management.
From Jared:
Roslyn and runtime have now spent a week of effort trynig to get around this limitation in Helix. Nothing we do will work cause the moment we touch the system the bug stops reproing. So we are at a point where
There needs to be a relief valve here. Some flag, setting, etc ... where we can break the limits and get valuable production crashes into the hands of the runtime team. The other outcome is we are effectivelsy content shipping this known crash cause we've exhausted our other avenues of getting the repro.
triage - who remembers what actually broke when dump sizes were over 1.6gb? cc/ @ilyas1974 @garath @riarenas
https://github.com/dotnet/core-eng/issues/15331 and https://github.com/dotnet/core-eng/issues/12275 have details on why we ended up limiting these.
Perfect - thanks! there's a discussion going on with @jaredpar , @steveisok, and @agocke about how to handle this. No consensus yet
The crux of it seems to be that with Roslyn+Mac (mostly) there are times when a huge dump upload is necessary. How to handle this as an exception is the open question however....
There's additional dump info here: https://github.com/dotnet/core-eng/issues/15333
The current suggestion from @jaredpar is to have an ENV var to disable limit when needed. e.g. HELIX_OVERRIDE_DUMP_LIMIT=true.
Thoughts? I think we're still in the brain-storming phase of trying to figure out how to get visibly into these gigantic dumps from time to time. (especially on macs)
I do understand why the limits exist as a default. Particularly on OSX a "no limit" approach makes it easy for a single bad PR, or a bad test across many PRs, to overwhelm our systems. Essentially uploading so much data that we can't ignore the size anymore.
However, to me the limits should exist to protect us from accidental uploads of this nature. There needs to be a "break glass" in the case we have real dumps for real product issues that we need to get off of the machine sand into the hands of the product team to look at.
Given we use environment variables to control helix today (where to find items, where to put dumps, etc ...) an environment variable seems like a natural approach. Say HELIX_OVERRIDE_DUMP_LIMIT=true
that we could set but can understand if we need a different tweak given nature of helix.
An escape mechanism seems like a good idea.
One concern I'd have and that we'd have to think through is how to make sure a PR that enables this env var or setting, or whatever, doesn't get accidentally checked in and ends up breaking everything really quickly. The previous issues showed that uploading these large mac dumps very easily broke the queues.
Another naive idea. Since these are on-prem machines, is there something we can do to instead keep dumps on disk for a certain amount of time, and allow folks to connect to the machines to extract them after removing the machine from the queue?
What if we wrote a supplementary script or tool that would:
(assuming we keep the dumps for a certain amount of time)
What if we wrote a supplementary script or tool that would:
- remove the machine from Helix
- get the crash dumps present in the machine for a given job
- Put the machine back in Helix
(assuming we keep the dumps for a certain amount of time)
This seems very similar to what the client code is doing already. Why not just have it ignore its own size limits if it sees the break-glass environment variable?
One concern I'd have and that we'd have to think through is how to make sure a PR that enables this env var or setting, or whatever, doesn't get accidentally checked in and ends up breaking everything really quickly.
The case I had in mind was specifically us checking in the variable. OSX is a CI only job for us (does not run on PR). There are a few reasons for that:
So in this case we'd actually need to check in the override. If it became a contention point we could temporarily move this to run in PRs and just jam the retry button down until it hit though.
What if we wrote a supplementary script or tool that would
That would work on our end. It's very easy for us to see after the fact which runs failed. If it's easy enough ofr you all to yank machines out afterwards that works.
The helix client code can add telemetry that measures how often, how big and how much time is spent handling dumps. That can be exposed in a dashboard in Grafana that should make it easy to see if something negative is happening.
What if we wrote a supplementary script or tool that would:
- remove the machine from Helix
- get the crash dumps present in the machine for a given job
- Put the machine back in Helix
(assuming we keep the dumps for a certain amount of time)
This seems very similar to what the client code is doing already. Why not just have it ignore its own size limits if it sees the break-glass environment variable?
I'm probably too paranoid, but I can see a dev checking in the env variable for PRs (for repos that do run mac tests during PR validation) so I thought having a separate tool that can't be checked in by accident would be helpful.
No qualms about trying the env var approach until the first time we have to deal with a queue breakage in case I'm just being overly careful.
In the original FR thread I mentioned this idea:
Here's the best solution I've got that you can implement right now: (1) Compress the dumps, (2) put them in HELIX_WORKITEM_UPLOAD_ROOT. From what I see, files there have no size limit but are subject to a total maximum time and a rate-of-upload minimum.
I'm curious if this was tried and what the result was?
Specifically, one of my concerns is that the problem may not be the upload speed so much as the risk that they fail to upload at all. IIRC this was an issue in early iterations. The former problem is probably managable using only the Helix client code. The latter problem is seems more hardware related and thus more complicated to make reliable.
I'm curious if this was tried and what the result was?
We've tried a number fo tricks to reduce the dump size and they all still existed above the helix limits.
The safest course of action here seems to be the ability to break glass by sequestering the machine with the HUGE dump on it. (notice I didn't write pillow)
Is that work something that would reasonably fit into our ops rotation?
Is that work something that would reasonably fit into our ops rotation?
By default, "yes". If it is discovered to be too big, then we evaluate alternative ways to get the job done (longer rotation, mini-epic, etc.).
In this case, my memory from when I looked at the helix client code for this issue before is that it's a pretty straightforward if/then/else guard. So probably reasonable.
How should we view the priority? I'll check it against the backlog and try to give an idea of when it can get done.
I'll check it against the backlog and try to give an idea of when it can get done.
Perfect. And, perhaps the plan will change, but at least we've got an idea to try.
In this case, my memory from when I looked at the helix client code for this issue before is that it's a pretty straightforward if/then/else guard. So probably reasonable.
I was there because that's how I found the older dump size issues: https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines?path=/resources/helix-scripts/helix/executor.py&version=GBmain&line=746&lineEnd=747&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents
Can you confirm @garath and @ilyas1974 that this is up next for triage?
I do so confirm.
We've been seeing consistent failures in this Roslyn Mac test job: https://dev.azure.com/dnceng-public/public/_build/results?buildId=461246&view=ms.vss-test-web.build-test-results-tab&runId=10423536&resultId=103188&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab
Unfortunately, we're not able to reproduce the failure locally, and aren't sure which test to blame. It would really help to have crash dumps here, but they're not being uploaded due to being too large (~4.6gb, larger than the ~1.6gb limit).
Is there anything we can do to further investigate the failure?
cc @jaredpar