Open sbomer opened 7 months ago
@sbomer Looks like this didn't work: ❌ Known issue did not match with the provided build.
I can't find the string in the test log.
Based on the logs in one of my PRs: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-100419-merge-91eb0ec8650642f7b5/WasmTestOnBrowser-System.Runtime.InteropServices.JavaScript.Tests/1/console.cb20659c.log?helixlogtype=result
Maybe fail: Tests timed out. Killing driver service pid
is a better ErrorMessage.
Actually I think this is a duplicate of https://github.com/dotnet/runtime/issues/99888. Build analysis correctly picked that one up for my PR.
This is not about the test failure in "browser-wasm linux Release LibraryTests_Threading" (build analysis identified that as https://github.com/dotnet/runtime/issues/99888), but about the product build failure in "browser-wasm linux Release LibraryTests":
Given the way it is failing this is probably an infrastructure issue
Seems to show that this is a runtime specific issue. It seems like instances are consistently happening in the wasm legs (along with catching a few other stray disconnects)
(Maybe you can try to change the error message to Agent was purged, cancelling the pipeline
? but I'm not sure if that will be picked up by the known issues infra as it's only being surfaced as a warning)
Thanks for the suggestion, didn't seem to work. :(
I'll note that when I watched the stdout of the build step for one of my PRs, it was building, even after 2 hours. So it seems like the builds for this lane are extremely slow. For comparison, the windows equivalent lane seems to build in around 40 minutes.
Actually I think this is a duplicate of #99888. Build analysis correctly picked that one up for my PR.
This is build/agent issue. Probably infra. AzDo kills the agent and deletes the log file!
Like this
This is before the agent was killed, something is terribly slow
The running step is CPU+memory heavy, maybe the agent is also swapping to disk.
This problem is blocking me
How would it look like if the agent failed with OOM ? We had similar issues on Helix last year, but Ankit got valid log files about it See https://github.com/dotnet/runtime/issues/51961.
The cancellations are very strange Steve and I watched it happen in a recent run.
cc @steveisok
I have seen the same issue on other CI legs as well
From the engineering services side, we don't have a lot more details on what is going on. While I try to track some information on what is happening to these agents and why these workloads would break them in such a way, I've reverted the version of the build.ubuntu.2204.amd64.open
image to the previous version.
Whether that helps or not will be a useful data point.
Possibly related: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627899&view=logs&jobId=f4616eca-4cb2-53ea-b86a-d7a1235a32a0 https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-100386-merge-2eeaae49b1db47e284/LibraryImportGenerator.Tests/1/console.ef4e1575.log?helixlogtype=result
/root/helix/work/workitem/e /root/helix/work/workitem/e
Discovering: LibraryImportGenerator.Tests (method display = ClassAndMethod, method display options = None)
Discovered: LibraryImportGenerator.Tests (found 124 of 128 test cases)
Starting: LibraryImportGenerator.Tests (parallel test collections = on [2 threads], stop on fail = off)
/root/helix/work/workitem/e
./RunTests.sh: line 179: 25 Killed "$RUNTIME_PATH/dotnet" exec --runtimeconfig LibraryImportGenerator.Tests.runtimeconfig.json --depsfile LibraryImportGenerator.Tests.deps.json xunit.console.dll LibraryImportGenerator.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
----- end Tue Apr 2 23:28:13 UTC 2024 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed either due to out of memory/resources (see /var/log/messages) or by explicit kill.
Does this indicate OOM?
I merged https://github.com/dotnet/runtime/pull/100517 to reduce/avoid the problem
I'm seeing this job time out during the "Build product" step, in a way that doesn't get reported to GitHub. On GitHub it looks like the job is still running forever.
Hit in multiple PRs, for example:
Build Information
Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=622511&view=logs&jobId=63c2d0c8-fec2-5788-81c8-f3ac95e8841f Build error leg or test failing: browser-wasm linux Release LibraryTests Build product
Error Message
Fill the error message using step by step known issues guidance.
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=622511 Error message validated:
[Agent was purged, cancelling the pipeline
] Result validation: :x: Known issue did not match with the provided build. Validation performed at: 4/1/2024 10:15:39 PM UTCReport
Summary