dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.28k stars 4.73k forks source link

Slow "Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release" #65626

Open EgorBo opened 2 years ago

EgorBo commented 2 years ago

Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release takes around 2.5H to finish.

There are some interesting anomalies in the logs, e.g.: image (I checked various runs)

It says that prejitting of a single managed assembly Microsoft.Win32.SystemEvents.dll takes almost 10 minutes 😮 (mostly in LLVM's opt+llc)

I parsed the output into an excel table: image

is it possible to move some libs/tests to the outerloop? e.g. JIT/Methodical/MDArray/GaussJordan/classarr_cs_do/classarr_cs_do test. And I guess we need to figure out what exactly makes Microsoft.Win32.SystemEvents.dll so long to prejit - there are not much stuff in it.

cc @akoeplinger @vargaz @steveisok

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @directhex See info in area-owners.md if you want to be subscribed.

Issue Details
`Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release` takes around 2.5H to finish. There are some interesting anomalies in the logs, e.g.: ![image](https://user-images.githubusercontent.com/523221/154822948-ea049c01-fc25-4694-a412-5612fe30354e.png) (I checked various runs) It says that prejitting of a single managed assembly `Microsoft.Win32.SystemEvents.dll` takes almost **10** minutes 😮 (mostly in LLVM's opt+llc) I parsed the output into an excel table: ![image](https://user-images.githubusercontent.com/523221/154823135-00d9df01-b948-436b-aaf6-3a13ff4b4f94.png) is it possible to move some libs/tests to the outerloop? e.g. `JIT/Methodical/MDArray/GaussJordan/classarr_cs_do/classarr_cs_do` test. And I guess we need to figure out what exactly makes `Microsoft.Win32.SystemEvents.dll` so long to prejit - there are not much stuff in it. cc @akoeplinger @vargaz @steveisok
Author: EgorBo
Assignees: -
Labels: `untriaged`, `area-Infrastructure-mono`
Milestone: -
steveisok commented 2 years ago

Adding @SamMonoRT

vargaz commented 2 years ago

Yes, these are very slow, they run opt+llc on unlinked assemblies.

agocke commented 2 years ago

This test run is pretty regularly timing out -- can we get someone to investigate if the slowness is a bug, or if we need to adjust the timeout?

SamMonoRT commented 2 years ago

This test run is pretty regularly timing out -- can we get someone to investigate if the slowness is a bug, or if we need to adjust the timeout?

This PR (https://github.com/dotnet/runtime/pull/66157) should help ease the timeouts seen in last couple weeks. Even with that fix, the lane is 2.5+hrs long. Still discussing this, but we might possibly 1. want to exclude certain long running tests as part of PR runs in this lane, 2. Extend the timeout to stabilize CI in the short term

agocke commented 2 years ago

@SamMonoRT which PR?

SamMonoRT commented 2 years ago

https://github.com/dotnet/runtime/pull/66157

agocke commented 2 years ago

Looks like that resolved the problem. I'm going to close this out for now.

EgorBo commented 1 year ago

It doesn't look fixed to me, every time this job is triggered it takes 4-5 hours, e.g. https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/146722/logs/1353 (from https://github.com/dotnet/runtime/pull/81094)

and since it's not an optional pipeline I think it's either has to be moved to be so or not all of the tests have to be precompiled with AOT.

EgorBo commented 1 year ago

I've wrote a quick parser for the output (for today's PR ^) and sorted assemblies by the time it takes to run LLVM (opt and llc) for them:

image

EgorBo commented 1 year ago

E.g. just by moving AdvSimd tests alone to an outerloop pipeline we can save ~30 minutes (4 dlls)

steveisok commented 1 year ago

E.g. just by moving AdvSimd tests alone to an outerloop pipeline we can save ~30 minutes (4 dlls)

I think I'd rather move the whole thing out and then analyze what we can run per PR.

steveisok commented 1 year ago

@EgorBo thanks for putting together the updated list!

SingleAccretion commented 1 year ago

Wanted to mention that we should be careful to leave enough testing on PRs to reliably catch failures introduced by adding new Jit tests. In my experience these are not uncommon.

SamMonoRT commented 9 months ago

@kotlarmilos @vitek-karas - not sure if this is something your team owns now and what more remains here? Please can you re-assign as appropriate.

steveisok commented 9 months ago

I'll take this as it likely has to do w/ the aot compiler performance itself.

agocke commented 9 months ago

My general philosophy is, "PR is for fast reliable tests" so I agree with the theory of moving everything out, then moving things back in that meet that criteria. Ideally we can find the sweet spot of fast + high confidence in finding bugs.