Closed radical closed 2 years ago
Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.
Author: | radical |
---|---|
Assignees: | - |
Labels: | `arch-wasm`, `area-Codegen-AOT-mono`, `disabled-test`, `wasm-aot-test` |
Milestone: | - |
This probably happens because the code is just too big.
The current failure:
cache:INFO: - ok
"/datadisks/disk1/work/A6A60972/p/build/emsdk/upstream/bin/wasm-ld" @/datadisks/disk1/work/A6A60972/t/emscripten_fqv4w7nk.rsp
1>emcc : error : '/datadisks/disk1/work/A6A60972/p/build/emsdk/upstream/bin/wasm-ld @/datadisks/disk1/work/A6A60972/t/emscripten_fqv4w7nk.rsp' failed (-9) [/datadisks/disk1/work/A6A60972/w/A34508DD/e/publish/AOTTestProjectForHelix.proj]
1>/datadisks/disk1/work/A6A60972/p/build/wasm/WasmApp.Native.targets(278,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/A6A60972/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/A6A60972/w/A34508DD/e/wasm_build/obj/wasm/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/A6A60972/w/A34508DD/e/publish/AOTTestProjectForHelix.proj]
1>Done Building Project "/datadisks/disk1/work/A6A60972/w/A34508DD/e/publish/AOTTestProjectForHelix.proj" (default targets) -- FAILED.
This should be fixed by https://github.com/dotnet/runtime/pull/53603 and the flto changes
I think this is still a problem. Hit this in release/6.0-rc2:
Linking with emcc. This may take a while ...
"/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-ld" @/datadisks/disk1/work/B1890944/t/emscripten_hxn4qmrx.rsp
"/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-emscripten-finalize" --minimize-wasm-changes -g --dyncalls-i64 --dwarf /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm --detect-features
"/datadisks/disk1/work/B1890944/p/build/emsdk/node/14.15.5_64bit/bin/node" /datadisks/disk1/work/B1890944/p/build/emsdk/upstream/emscripten/src/compiler.js /datadisks/disk1/work/B1890944/t/tmpw7z52_k9.txt
"/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-opt" --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals
emcc : error : '/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals' failed (received SIGKILL (-9)) [/datadisks/disk1/work/B1890944/w/A44809CF/e/publish/ProxyProjectForAOTOnHelix.proj]
/datadisks/disk1/work/B1890944/p/build/wasm/WasmApp.Native.targets(369,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/B1890944/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/B1890944/w/A44809CF/e/publish/ProxyProjectForAOTOnHelix.proj]
XHarness artifacts: /datadisks/disk1/work/B1890944/w/A44809CF/uploads/xharness-output
CC @layomia
Also likely causing issue for new logging generator tests. See https://github.com/dotnet/runtime/pull/59092#issuecomment-923219244
Also likely causing issue for new logging generator tests. See #59092 (comment)
That is failing when compiling Microsoft.CodeAnalysis.CSharp.dll.bc
with emcc
.
That is failing when compiling Microsoft.CodeAnalysis.CSharp.dll.bc with emcc.
Yes to help determine what is causing the failure, there were a few temporary commits where the the new functional code and tests were no longer called, but the CI issue remained.
It might be the linker running out of memory again? Would disabling optimizations in linker -Wl,-lto-O0
help here?
This will get fixed by https://github.com/dotnet/runtime/pull/59397 , same as it did on main
.
This will get fixed by #59397 , same as it did on
main
.
https://github.com/dotnet/runtime/pull/59092 is not fixed by that
I see this
emcc : error : '/datadisks/disk1/work/B1D309C1/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B1D309C1/w/99A208F6/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1D309C1/w/99A208F6/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals' failed (received SIGKILL (-9)) [/datadisks/disk1/work/B1D309C1/w/99A208F6/e/publish/ProxyProjectForAOTOnHelix.proj]
/datadisks/disk1/work/B1D309C1/p/build/wasm/WasmApp.Native.targets(374,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/B1D309C1/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/B1D309C1/w/99A208F6/e/wasm_build/obj/wasm/for-build/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/B1D309C1/w/99A208F6/e/publish/ProxyProjectForAOTOnHelix.proj]
XHarness artifacts: /datadisks/disk1/work/B1D309C1/w/99A208F6/uploads/xharness-output
https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-60112-merge-1c763c950edf44f2b0/System.Text.Json.Tests/1/console.0ff5206b.log?sv=2019-07-07&se=2021-10-27T13%3A48%3A50Z&sr=c&sp=rl&sig=Vlj0U9Lm4w1sYqrLDYGVsnvTmUldx6%2B0sr2PWre4tyI%3D
Console log: 'System.Text.Json.Tests' from job 1c763c95-0edf-44f2-b01f-ab2dd0c22a82 workitem 64f04465-1537-4154-b8a9-f39b9be56ef2 (ubuntu.1804.amd64.open.svc) executed on machine a001G8K
this ran on 10/7/2021, 8:32:46 AM
. from the comments above I'm not sure whether this was expected to be fixed (this is 6.0)
wasm-opt
getting killed due to OOM: failed (received SIGKILL (-9))
. We are seeing more of oom kills recently with STJ, and test projects that use source generators.
In some cases they fail when compiling an assembly's bitcode to wasm .o
, and in some it fails when linking (including wasm-opt).
I plan to:
reopenning since we've re-disabled tests
@ilyas1974 we are running into this over and over, what can we do to get more memory on the VM.
Struggling to repro this locally. 😰
Oh no, I can reproduce it locally, and learned something very cursed about Docker
Okay. Findings:
wasm-opt
, requires more memory than Clang - 8.5 gig.The problem is AOTing roslyn, a linked Microsoft.CodeAnalysis.CSharp.dll still contains 30k methods, AOT creates 80k methods out of this because roslyn contains a lot of valuetypes causing a lot of generic instances to be added, running the aot compiler on this assembly alone takes 40 seconds, the .bc -> .o step takes 10 minutes (!).
We might want to add some functionality to the build system to exclude certain assemblies like roslyn from AOTing.
We might want to add some functionality to the build system to exclude certain assemblies like roslyn from AOTing.
We do have that, as a hack (https://github.com/dotnet/runtime/blob/main/src/tasks/AotCompilerTask/MonoAOTCompiler.cs#L488-L489).
So what's the play, try to shoehorn the test as-is onto Helix, or try to wire up the AOT skip?
I'd suggest skipping AOTing for these assemblies, they take forever to AOT which also impacts CI performance even if it works.
I skipped AOT, and I have no idea how to parse this failure https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-63958-merge-c3e46af019234e5a83/normal-Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests/1/console.0b1eb581.log?sv=2019-07-07&se=2022-02-10T15%3A15%3A41Z&sr=c&sp=rl&sig=uuwgEA0aSi%2BkNV%2BWkslo%2FBLzkvgx4guydgtVeRkyidQ%3D
Failure is at https://github.com/dotnet/runtime/blob/9c8b804099ce79f44733c23d4fc59801ab03f86d/src/mono/mono/mini/aot-runtime.c#L2300-L2303 which makes sense
Is there a PR which skips the AOT step for these assemblies ?
An alternative would be compiling with O1, i.e.:
<EmccLinkOptimizationFlag>-O1</EmccLinkOptimizationFlag>
Running with this reduced the total runtime from 16 minutes to 11 minutes on my machine, it might solve the OOM problem as well.
@directhex @radical - the latest rolling build (which has #63958) still hit this.
Still looks to be an OOM, but I don't know how to get more information about the exact cause.
emcc : error : '/datadisks/disk1/work/A50F091E/p/build/emsdk/upstream/bin/wasm-emscripten-finalize --minimize-wasm-changes -g --dyncalls-i64 --dwarf /datadisks/disk1/work/A50F091E/w/B308094D/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/A50F091E/w/B308094D/e/wasm_build/obj/wasm/for-build/dotnet.wasm --detect-features' failed (received SIGKILL (-9)) [/datadisks/disk1/work/A50F091E/w/B308094D/e/publish/ProxyProjectForAOTOnHelix.proj]
/datadisks/disk1/work/A50F091E/p/build/wasm/WasmApp.Native.targets(404,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/A50F091E/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/A50F091E/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-link.rsp" "@/datadisks/disk1/work/A50F091E/w/B308094D/e/wasm_build/obj/wasm/for-build/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/A50F091E/w/B308094D/e/publish/ProxyProjectForAOTOnHelix.proj]
[ 5962.061349] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/helix.service,task=wasm-emscripten,pid=23875,uid=1000
[ 5962.061364] Out of memory: Killed process 23875 (wasm-emscripten) total-vm:5798616kB, anon-rss:5590440kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:11100kB oom_score_adj:0
[ 5962.258803] oom_reaper: reaped process 23875 (wasm-emscripten), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
~System.Text.Json.Tests~
Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests, log
The binlog should have details of why this failed. The binlog is included in the helix data.