dotnet / dnceng

.NET Engineering Services
MIT License
21 stars 14 forks source link

Insufficient memory of docker containers on CI #450

Open fanyang-mono opened 10 months ago

fanyang-mono commented 10 months ago

Build

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=351450

Build leg reported

Build / browser-wasm linux Release LibraryTests / Build product

Pull Request

https://github.com/dotnet/runtime/pull/89217

Known issue core information

Fill out the known issue JSON section by following the step by step documentation on how to create a known issue

 {
    "ErrorMessage" : "[error]Exit code 137 returned from process: file name '/usr/bin/docker'",
    "BuildRetry": false,
    "ErrorPattern": "",
    "ExcludeConsoleLog": false
 }

@dotnet/dnceng

Release Note Category

Additional information about the issue reported

No response

Report

Build Definition Step Name Console log Pull Request
694253 dotnet/runtime Build product Log
690754 dotnet/runtime Build product Log
686902 dotnet/runtime Build product Log
686411 dotnet/runtime Build product Log
683686 dotnet/runtime Build Tests Log dotnet/runtime#102505
682021 dotnet/runtime LLVM AOT compile CoreCLR tests Log
681751 dotnet/runtime LLVM AOT compile CoreCLR tests Log
680841 dotnet/runtime Build Tests Log dotnet/runtime#102432
679966 dotnet/runtime LLVM AOT compile CoreCLR tests Log
679692 dotnet/runtime LLVM AOT compile CoreCLR tests Log
679746 dotnet/runtime Build Tests Log dotnet/runtime#102400
679285 dotnet/runtime Build product Log
678653 dotnet/runtime LLVM AOT compile CoreCLR tests Log
678386 dotnet/runtime LLVM AOT compile CoreCLR tests Log
677394 dotnet/runtime LLVM AOT compile CoreCLR tests Log
675712 dotnet/runtime LLVM AOT compile CoreCLR tests Log
675458 dotnet/runtime LLVM AOT compile CoreCLR tests Log
674146 dotnet/runtime LLVM AOT compile CoreCLR tests Log
671841 dotnet/runtime LLVM AOT compile CoreCLR tests Log
664347 dotnet/runtime LLVM AOT compile CoreCLR tests Log

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 2 20

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450 Error message validated: [error]Exit code 137 returned from process: file name '/usr/bin/docker' Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 7/26/2023 2:43:39 PM UTC

andriipatsula commented 10 months ago

Hello @fanyang-mono, could you please update the "ErrorMessage" : "" by following the step by step documentation on how to create a known issue

fanyang-mono commented 10 months ago

Updated.

missymessa commented 10 months ago

It's likely your process is using too much memory. Check to see when this started and if there were code changes around that time that could have caused this to occur.

https://www.airplane.dev/blog/exit-code-137

missymessa commented 10 months ago

@fanyang-mono, is this an infra issue? It looks like the errors are isolated to Runtime.

fanyang-mono commented 10 months ago

@lewing Could you please confirm that this is a wasm build issue? This is the direct link to the build log https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450&view=logs&j=d4e38924-13a0-58bd-9074-6a4810543e7c&t=102a6595-1420-53fc-8f17-b0a3f4b1242a&l=5722

lewing commented 10 months ago

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/352553/logs/541 is definitely not a wasm build issue

lewing commented 10 months ago

exit code 127 typically means the process was sent a sig kill 128 + 9 = 137. Given that this is happening inside docker containers it is likely because they are hitting resource limits

lewing commented 10 months ago

what are the limits on the cloudtest containers?

lewing commented 9 months ago

Based on the tracking we're seeing failures across multiple unrelated lanes (although they tend to be llvm related lanes). This is going to continue to cause pain unless we can get some idea of which processes are using memory at the point that the container is killed.

radical commented 9 months ago

@missymessa It would be very helpful to know what the limits are on the container. We might be running too close to the limits, in which case it would be helpful to have those bumped up.

lewing commented 9 months ago

@dotnet/dnceng this is causing considerable pain how should we escalate it? We can't diagnose the failures across multiple lanes and different runtimes without more detail.

dougbu commented 9 months ago

previous teams dealing w/ exit code 137 have worked w/ people on the runtime team to collect crash dumps and determine the root cause. it's also likely something changed in the runtime repo about a month ago that led to this issue.

lewing commented 9 months ago

@dougbu the failures here are fairly random and span very different runtimes so a crash dump isn't likely to be deterministic. I would love to see the state of the container at shutdown time.

lewing commented 9 months ago

cc @agocke for the nativeAOT failures

lewing commented 9 months ago

@dougbu or edit the core information to retry, I can't

lewing commented 9 months ago

also https://github.com/dotnet/runtime/issues/89402

dougbu commented 9 months ago

@lewing we don't have much to go on here. for one thing, we don't mess w/ "limits" in the Helix queues other than the file count maximum.

suggest you use the helix-repro-vms DevTest Labs to create a VM matching the queue used in your tests. then, do whatever you can to run the tests on that VM in a way that captures a dump. the dump should at least indicate what is causing the exit code. note the core dump should be created in the main process, not w/in the Docker container. I believe @agocke has experience using dumps to debug occasional build and test strangeness's.

we can increase whatever limit appears to be the problem, within limits.

dougbu commented 9 months ago

on test retries, please consider changing your eng/test-configuration.json file. that's documented in https://github.com/dotnet/arcade/blob/d3b8861e20aaf0179034c6076d156e2442b26f9b/src/Microsoft.DotNet.Helix/Sdk/Readme.md#test-retry and dotnet/runtime's file already automatically retries based on a handful of error messages

dougbu commented 9 months ago

oh, btw, if it's a true memory restriction as dotnet/runtime#89402 was, we might be able to bump things up. however there might not be budget and the problem certainly isn't related to a decrease in anything on our side. more likely the test count or memory footprint went up before this issue was observed. if that's the case, the most straightforward fix would be to split a large test project in two

fanyang-mono commented 9 months ago

According to the table, linux-x64 Mono LLVMFullAot RuntimeTests lane also ran out of memory of the docker container during AOT very often.

riarenas commented 9 months ago

Should this be moved to the runtime repo since it only affects that repo, especially since we're waiting for additional information while they check a repro vm?

dougbu commented 6 months ago

@lewing please move this to the runtime repo (and, perhaps, work using the helix-repro-vms to narrow the issue down). when you've found a specific action to take, please describe it in the First Responders channel. we may have a way to bump limits but it's more likely the runtime team will need to reduce or simplify something to resolve this issue.

dougbu commented 4 months ago

ping @lewing. we're still hitting this problem occasionally but I'm not seeing anything outside runtime builds. there might be some change we could make but we don't have any information on our side. if you have a suggestion…