dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.22k stars 4.72k forks source link

Wasm-opt can take way too long #100424

Open sbomer opened 7 months ago

sbomer commented 7 months ago

I'm seeing this job time out during the "Build product" step, in a way that doesn't get reported to GitHub. On GitHub it looks like the job is still running forever.

Hit in multiple PRs, for example:

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=622511&view=logs&jobId=63c2d0c8-fec2-5788-81c8-f3ac95e8841f Build error leg or test failing: browser-wasm linux Release LibraryTests Build product

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "Agent was purged, cancelling the pipeline",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=622511 Error message validated: [Agent was purged, cancelling the pipeline] Result validation: :x: Known issue did not match with the provided build. Validation performed at: 4/1/2024 10:15:39 PM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
agocke commented 7 months ago

@sbomer Looks like this didn't work: ❌ Known issue did not match with the provided build.

I can't find the string in the test log.

vcsjones commented 7 months ago

Based on the logs in one of my PRs: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-100419-merge-91eb0ec8650642f7b5/WasmTestOnBrowser-System.Runtime.InteropServices.JavaScript.Tests/1/console.cb20659c.log?helixlogtype=result

Maybe fail: Tests timed out. Killing driver service pid is a better ErrorMessage.

vcsjones commented 7 months ago

Actually I think this is a duplicate of https://github.com/dotnet/runtime/issues/99888. Build analysis correctly picked that one up for my PR.

sbomer commented 7 months ago

This is not about the test failure in "browser-wasm linux Release LibraryTests_Threading" (build analysis identified that as https://github.com/dotnet/runtime/issues/99888), but about the product build failure in "browser-wasm linux Release LibraryTests":

Screenshot 2024-03-28 at 15 10 32
lewing commented 7 months ago

Given the way it is failing this is probably an infrastructure issue

riarenas commented 7 months ago

https://helix.dot.net/BuildAnalysis/SearchTimeline?error=Agent%20was%20purged,%20cancelling%20the%20pipeline&dateType=Day(s)&dateValue=7&pageNumber=3

Seems to show that this is a runtime specific issue. It seems like instances are consistently happening in the wasm legs (along with catching a few other stray disconnects)

(Maybe you can try to change the error message to Agent was purged, cancelling the pipeline? but I'm not sure if that will be picked up by the known issues infra as it's only being surfaced as a warning)

sbomer commented 7 months ago

Thanks for the suggestion, didn't seem to work. :(

kg commented 7 months ago

I'll note that when I watched the stdout of the build step for one of my PRs, it was building, even after 2 hours. So it seems like the builds for this lane are extremely slow. For comparison, the windows equivalent lane seems to build in around 40 minutes.

pavelsavara commented 7 months ago

Actually I think this is a duplicate of #99888. Build analysis correctly picked that one up for my PR.

99888 is during test run on helix not related to this.

This is build/agent issue. Probably infra. AzDo kills the agent and deletes the log file!

Like this image

This is before the agent was killed, something is terribly slow image

The running step is CPU+memory heavy, maybe the agent is also swapping to disk.

This problem is blocking me

pavelsavara commented 7 months ago

How would it look like if the agent failed with OOM ? We had similar issues on Helix last year, but Ankit got valid log files about it See https://github.com/dotnet/runtime/issues/51961.

ilonatommy commented 7 months ago

If we are unlucky there are no logs at all: image https://dev.azure.com/dnceng-public/public/_build/results?buildId=626743&view=logs&jobId=d4e38924-13a0-58bd-9074-6a4810543e7c&j=d4e38924-13a0-58bd-9074-6a4810543e7c&t=102a6595-1420-53fc-8f17-b0a3f4b1242a

lewing commented 7 months ago

The cancellations are very strange Steve and I watched it happen in a recent run.

cc @steveisok

rzikm commented 7 months ago

I have seen the same issue on other CI legs as well

image

riarenas commented 7 months ago

From the engineering services side, we don't have a lot more details on what is going on. While I try to track some information on what is happening to these agents and why these workloads would break them in such a way, I've reverted the version of the build.ubuntu.2204.amd64.open image to the previous version.

Whether that helps or not will be a useful data point.

kg commented 7 months ago

Possibly related: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627899&view=logs&jobId=f4616eca-4cb2-53ea-b86a-d7a1235a32a0 https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-100386-merge-2eeaae49b1db47e284/LibraryImportGenerator.Tests/1/console.ef4e1575.log?helixlogtype=result

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: LibraryImportGenerator.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Tests (found 124 of 128 test cases)
  Starting:    LibraryImportGenerator.Tests (parallel test collections = on [2 threads], stop on fail = off)
/root/helix/work/workitem/e
./RunTests.sh: line 179:    25 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig LibraryImportGenerator.Tests.runtimeconfig.json --depsfile LibraryImportGenerator.Tests.deps.json xunit.console.dll LibraryImportGenerator.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
----- end Tue Apr 2 23:28:13 UTC 2024 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed either due to out of memory/resources (see /var/log/messages) or by explicit kill.

Does this indicate OOM?

pavelsavara commented 7 months ago

I merged https://github.com/dotnet/runtime/pull/100517 to reduce/avoid the problem