dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

[wasm][aot] Link step failing for `Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests` - `wasm-opt` getting killed #51961

Closed radical closed 2 years ago

radical commented 3 years ago

~System.Text.Json.Tests~

Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests, log

emcc : error : '/datadisks/disk1/work/B7A309CE/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -O2 --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B7A309CE/w/B09909A9/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B7A309CE/w/B09909A9/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals'
   failed (received SIGKILL (-9)) [/datadisks/disk1/work/B7A309CE/w/B09909A9/e/publish/ProxyProjectForAOTOnHelix.proj]
[ 2582.708542] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/helix.service,task=wasm-opt,pid=19587,uid=1000
[ 2582.708558] Out of memory: Killed process 19587 (wasm-opt) total-vm:7377176kB, anon-rss:7254096kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:14340kB oom_score_adj:0
[ 2582.925546] oom_reaper: reaped process 19587 (wasm-opt), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The binlog should have details of why this failed. The binlog is included in the helix data.

ghost commented 3 years ago

Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.

Issue Details
[System.Text.Json.Tests](https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51697-merge-72ac59d12bb14b39b3/System.Text.Json.Tests/console.83352166.log?sv=2019-07-07&se=2021-05-17T21%3A14%3A13Z&sr=c&sp=rl&sig=7JbQysPBNScc45cDhvWvUsLBKY1%2F93p9OPVqSmVokwQ%3D) ``` emcc : error : '/home/helixbot/work/AA8B0906/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /home/helixbot/work/AA8B0906/w/A4FB08F1/e/wasm_build/obj/wasm/dotnet.wasm -o /home/helixbot/work/AA8B0906/w/A4FB08F1/e/wasm_build/obj/wasm/dotnet.wasm -g --mvp-features --enable-mutable-globals' failed (-9) [/home/helixbot/work/AA8B0906/w/A4FB08F1/e/publish/AOTTestProjectForHelix.proj] /home/helixbot/work/AA8B0906/p/build/wasm/WasmApp.targets(443,5): error MSB3073: The command "emcc "@/home/helixbot/work/AA8B0906/w/A4FB08F1/e/wasm_build/obj/wasm/emcc-link.rsp"" exited with code 1. [/home/helixbot/work/AA8B0906/w/A4FB08F1/e/publish/AOTTestProjectForHelix.proj] ``` The binlog should have details of why this failed. The binlog is included in the helix data.
Author: radical
Assignees: -
Labels: `arch-wasm`, `area-Codegen-AOT-mono`, `disabled-test`, `wasm-aot-test`
Milestone: -
vargaz commented 3 years ago

This probably happens because the code is just too big.

radical commented 3 years ago

The current failure:

         cache:INFO:  - ok
          "/datadisks/disk1/work/A6A60972/p/build/emsdk/upstream/bin/wasm-ld" @/datadisks/disk1/work/A6A60972/t/emscripten_fqv4w7nk.rsp
     1>emcc : error : '/datadisks/disk1/work/A6A60972/p/build/emsdk/upstream/bin/wasm-ld @/datadisks/disk1/work/A6A60972/t/emscripten_fqv4w7nk.rsp' failed (-9) [/datadisks/disk1/work/A6A60972/w/A34508DD/e/publish/AOTTestProjectForHelix.proj]
     1>/datadisks/disk1/work/A6A60972/p/build/wasm/WasmApp.Native.targets(278,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/A6A60972/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/A6A60972/w/A34508DD/e/wasm_build/obj/wasm/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/A6A60972/w/A34508DD/e/publish/AOTTestProjectForHelix.proj]
     1>Done Building Project "/datadisks/disk1/work/A6A60972/w/A34508DD/e/publish/AOTTestProjectForHelix.proj" (default targets) -- FAILED.
lewing commented 3 years ago

This should be fixed by https://github.com/dotnet/runtime/pull/53603 and the flto changes

ericstj commented 3 years ago

I think this is still a problem. Hit this in release/6.0-rc2:


  Linking with emcc. This may take a while ...
   "/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-ld" @/datadisks/disk1/work/B1890944/t/emscripten_hxn4qmrx.rsp
   "/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-emscripten-finalize" --minimize-wasm-changes -g --dyncalls-i64 --dwarf /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm --detect-features
   "/datadisks/disk1/work/B1890944/p/build/emsdk/node/14.15.5_64bit/bin/node" /datadisks/disk1/work/B1890944/p/build/emsdk/upstream/emscripten/src/compiler.js /datadisks/disk1/work/B1890944/t/tmpw7z52_k9.txt
   "/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-opt" --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals
emcc : error : '/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals' failed (received SIGKILL (-9)) [/datadisks/disk1/work/B1890944/w/A44809CF/e/publish/ProxyProjectForAOTOnHelix.proj]
/datadisks/disk1/work/B1890944/p/build/wasm/WasmApp.Native.targets(369,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/B1890944/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/B1890944/w/A44809CF/e/publish/ProxyProjectForAOTOnHelix.proj]
XHarness artifacts: /datadisks/disk1/work/B1890944/w/A44809CF/uploads/xharness-output

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-59295-merge-2cb4a6848b884da6ae/System.Text.Json.Tests/1/console.380b5c81.log?sv=2019-07-07&se=2021-10-08T18%3A00%3A28Z&sr=c&sp=rl&sig=jvWokQtPb4dRituhcL%2FAy82QK5Rz6S7goic8ukzXI5E%3D

CC @layomia

steveharter commented 3 years ago

Also likely causing issue for new logging generator tests. See https://github.com/dotnet/runtime/pull/59092#issuecomment-923219244

radical commented 3 years ago

Also likely causing issue for new logging generator tests. See #59092 (comment)

That is failing when compiling Microsoft.CodeAnalysis.CSharp.dll.bc with emcc.

steveharter commented 3 years ago

That is failing when compiling Microsoft.CodeAnalysis.CSharp.dll.bc with emcc.

Yes to help determine what is causing the failure, there were a few temporary commits where the the new functional code and tests were no longer called, but the CI issue remained.

radekdoulik commented 3 years ago

It might be the linker running out of memory again? Would disabling optimizations in linker -Wl,-lto-O0 help here?

radical commented 3 years ago

This will get fixed by https://github.com/dotnet/runtime/pull/59397 , same as it did on main.

lewing commented 3 years ago

This will get fixed by #59397 , same as it did on main.

https://github.com/dotnet/runtime/pull/59092 is not fixed by that

danmoseley commented 3 years ago

I see this

emcc : error : '/datadisks/disk1/work/B1D309C1/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B1D309C1/w/99A208F6/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1D309C1/w/99A208F6/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals' failed (received SIGKILL (-9)) [/datadisks/disk1/work/B1D309C1/w/99A208F6/e/publish/ProxyProjectForAOTOnHelix.proj]
/datadisks/disk1/work/B1D309C1/p/build/wasm/WasmApp.Native.targets(374,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/B1D309C1/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/B1D309C1/w/99A208F6/e/wasm_build/obj/wasm/for-build/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/B1D309C1/w/99A208F6/e/publish/ProxyProjectForAOTOnHelix.proj]
XHarness artifacts: /datadisks/disk1/work/B1D309C1/w/99A208F6/uploads/xharness-output

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-60112-merge-1c763c950edf44f2b0/System.Text.Json.Tests/1/console.0ff5206b.log?sv=2019-07-07&se=2021-10-27T13%3A48%3A50Z&sr=c&sp=rl&sig=Vlj0U9Lm4w1sYqrLDYGVsnvTmUldx6%2B0sr2PWre4tyI%3D Console log: 'System.Text.Json.Tests' from job 1c763c95-0edf-44f2-b01f-ab2dd0c22a82 workitem 64f04465-1537-4154-b8a9-f39b9be56ef2 (ubuntu.1804.amd64.open.svc) executed on machine a001G8K

this ran on 10/7/2021, 8:32:46 AM. from the comments above I'm not sure whether this was expected to be fixed (this is 6.0)

radical commented 3 years ago

wasm-opt getting killed due to OOM: failed (received SIGKILL (-9)). We are seeing more of oom kills recently with STJ, and test projects that use source generators.

In some cases they fail when compiling an assembly's bitcode to wasm .o, and in some it fails when linking (including wasm-opt).

I plan to:

  1. try to detect this case (very roughly), and dump some logs to help with debugging
  2. maybe skip some of the big assemblies that fail when compiling
  3. Explore using different optimization flags to reduce the memory footprint (we did try this a bit, and it didn't help)
krwq commented 3 years ago

reopenning since we've re-disabled tests

lewing commented 3 years ago

@ilyas1974 we are running into this over and over, what can we do to get more memory on the VM.

directhex commented 2 years ago

Struggling to repro this locally. 😰

directhex commented 2 years ago

Oh no, I can reproduce it locally, and learned something very cursed about Docker

directhex commented 2 years ago

Okay. Findings:

vargaz commented 2 years ago

The problem is AOTing roslyn, a linked Microsoft.CodeAnalysis.CSharp.dll still contains 30k methods, AOT creates 80k methods out of this because roslyn contains a lot of valuetypes causing a lot of generic instances to be added, running the aot compiler on this assembly alone takes 40 seconds, the .bc -> .o step takes 10 minutes (!).

vargaz commented 2 years ago

We might want to add some functionality to the build system to exclude certain assemblies like roslyn from AOTing.

radical commented 2 years ago

We might want to add some functionality to the build system to exclude certain assemblies like roslyn from AOTing.

We do have that, as a hack (https://github.com/dotnet/runtime/blob/main/src/tasks/AotCompilerTask/MonoAOTCompiler.cs#L488-L489).

directhex commented 2 years ago

So what's the play, try to shoehorn the test as-is onto Helix, or try to wire up the AOT skip?

vargaz commented 2 years ago

I'd suggest skipping AOTing for these assemblies, they take forever to AOT which also impacts CI performance even if it works.

directhex commented 2 years ago

I skipped AOT, and I have no idea how to parse this failure https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-63958-merge-c3e46af019234e5a83/normal-Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests/1/console.0b1eb581.log?sv=2019-07-07&se=2022-02-10T15%3A15%3A41Z&sr=c&sp=rl&sig=uuwgEA0aSi%2BkNV%2BWkslo%2FBLzkvgx4guydgtVeRkyidQ%3D

directhex commented 2 years ago

Failure is at https://github.com/dotnet/runtime/blob/9c8b804099ce79f44733c23d4fc59801ab03f86d/src/mono/mono/mini/aot-runtime.c#L2300-L2303 which makes sense

vargaz commented 2 years ago

Is there a PR which skips the AOT step for these assemblies ?

vargaz commented 2 years ago

An alternative would be compiling with O1, i.e.:

    <EmccLinkOptimizationFlag>-O1</EmccLinkOptimizationFlag>

Running with this reduced the total runtime from 16 minutes to 11 minutes on my machine, it might solve the OOM problem as well.

directhex commented 2 years ago

https://github.com/dotnet/runtime/pull/63958

vargaz commented 2 years ago

https://github.com/dotnet/runtime/pull/64294

elinor-fung commented 2 years ago

@directhex @radical - the latest rolling build (which has #63958) still hit this.

Still looks to be an OOM, but I don't know how to get more information about the exact cause.

emcc : error : '/datadisks/disk1/work/A50F091E/p/build/emsdk/upstream/bin/wasm-emscripten-finalize --minimize-wasm-changes -g --dyncalls-i64 --dwarf /datadisks/disk1/work/A50F091E/w/B308094D/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/A50F091E/w/B308094D/e/wasm_build/obj/wasm/for-build/dotnet.wasm --detect-features' failed (received SIGKILL (-9)) [/datadisks/disk1/work/A50F091E/w/B308094D/e/publish/ProxyProjectForAOTOnHelix.proj]
/datadisks/disk1/work/A50F091E/p/build/wasm/WasmApp.Native.targets(404,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/A50F091E/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/A50F091E/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-link.rsp" "@/datadisks/disk1/work/A50F091E/w/B308094D/e/wasm_build/obj/wasm/for-build/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/A50F091E/w/B308094D/e/publish/ProxyProjectForAOTOnHelix.proj]
[ 5962.061349] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/helix.service,task=wasm-emscripten,pid=23875,uid=1000
[ 5962.061364] Out of memory: Killed process 23875 (wasm-emscripten) total-vm:5798616kB, anon-rss:5590440kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:11100kB oom_score_adj:0
[ 5962.258803] oom_reaper: reaped process 23875 (wasm-emscripten), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB