Open fanyang-mono opened 10 months ago
Hello @fanyang-mono, could you please update the "ErrorMessage" : ""
by following the step by step documentation on how to create a known issue
Updated.
It's likely your process is using too much memory. Check to see when this started and if there were code changes around that time that could have caused this to occur.
@fanyang-mono, is this an infra issue? It looks like the errors are isolated to Runtime.
@lewing Could you please confirm that this is a wasm build issue? This is the direct link to the build log https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450&view=logs&j=d4e38924-13a0-58bd-9074-6a4810543e7c&t=102a6595-1420-53fc-8f17-b0a3f4b1242a&l=5722
https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/352553/logs/541 is definitely not a wasm build issue
exit code 127 typically means the process was sent a sig kill 128 + 9 = 137. Given that this is happening inside docker containers it is likely because they are hitting resource limits
what are the limits on the cloudtest containers?
Based on the tracking we're seeing failures across multiple unrelated lanes (although they tend to be llvm related lanes). This is going to continue to cause pain unless we can get some idea of which processes are using memory at the point that the container is killed.
@missymessa It would be very helpful to know what the limits are on the container. We might be running too close to the limits, in which case it would be helpful to have those bumped up.
@dotnet/dnceng this is causing considerable pain how should we escalate it? We can't diagnose the failures across multiple lanes and different runtimes without more detail.
previous teams dealing w/ exit code 137
have worked w/ people on the runtime team to collect crash dumps and determine the root cause. it's also likely something changed in the runtime repo about a month ago that led to this issue.
@dougbu the failures here are fairly random and span very different runtimes so a crash dump isn't likely to be deterministic. I would love to see the state of the container at shutdown time.
cc @agocke for the nativeAOT failures
@dougbu or edit the core information to retry, I can't
@lewing we don't have much to go on here. for one thing, we don't mess w/ "limits" in the Helix queues other than the file count maximum.
suggest you use the helix-repro-vms
DevTest Labs to create a VM matching the queue used in your tests. then, do whatever you can to run the tests on that VM in a way that captures a dump. the dump should at least indicate what is causing the exit code. note the core dump should be created in the main process, not w/in the Docker container. I believe @agocke has experience using dumps to debug occasional build and test strangeness's.
we can increase whatever limit appears to be the problem, within limits.
on test retries, please consider changing your eng/test-configuration.json file. that's documented in https://github.com/dotnet/arcade/blob/d3b8861e20aaf0179034c6076d156e2442b26f9b/src/Microsoft.DotNet.Helix/Sdk/Readme.md#test-retry and dotnet/runtime's file already automatically retries based on a handful of error messages
oh, btw, if it's a true memory restriction as dotnet/runtime#89402 was, we might be able to bump things up. however there might not be budget and the problem certainly isn't related to a decrease in anything on our side. more likely the test count or memory footprint went up before this issue was observed. if that's the case, the most straightforward fix would be to split a large test project in two
According to the table, linux-x64 Mono LLVMFullAot RuntimeTests lane also ran out of memory of the docker container during AOT very often.
Should this be moved to the runtime repo since it only affects that repo, especially since we're waiting for additional information while they check a repro vm?
@lewing please move this to the runtime repo (and, perhaps, work using the helix-repro-vms
to narrow the issue down). when you've found a specific action to take, please describe it in the First Responders channel. we may have a way to bump limits but it's more likely the runtime team will need to reduce or simplify something to resolve this issue.
ping @lewing. we're still hitting this problem occasionally but I'm not seeing anything outside runtime builds. there might be some change we could make but we don't have any information on our side. if you have a suggestion…
Build
https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=351450
Build leg reported
Build / browser-wasm linux Release LibraryTests / Build product
Pull Request
https://github.com/dotnet/runtime/pull/89217
Known issue core information
Fill out the known issue JSON section by following the step by step documentation on how to create a known issue
@dotnet/dnceng
Release Note Category
Release Note Description
Additional information about the issue reported
No response
Report
Summary
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450 Error message validated:
[error]Exit code 137 returned from process: file name '/usr/bin/docker'
Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 7/26/2023 2:43:39 PM UTC