JIT.jit64.mcc fails on newly enabled Windows arm32 queue Windows.10.Arm64v8.Open

trylek commented 4 years ago

After I enabled Windows arm32 runs using the new Galaxy Book laptop queue (Windows.10.Arm64v8.Open), we’re starting to monitor the first errors on that queue 😊. We now see a weirdly systematic error in the “JIT.jit64.mcc” work item, for instance in this run:

https://dev.azure.com/dnceng/public/_build/results?buildId=521323&view=logs&jobId=6c46bee0-e095-5eff-8d48-d352951d0d7b

It has two different manifestations: either the Helix log is not available at all (in the quoted run this is the case for the “no_tiered_compilation” flavor of the Windows arm32 job), or it’s present (like in the other Windows arm32 job in the same run) and complains about the missing XUnit wrapper for the test:

C:\h\w\AE7509C1\w\A981096F\e>C:\h\w\AE7509C1\p\CoreRun.exe C:\h\w\AE7509C1\p\xunit.console.dll JIT\jit64\JIT.jit64.XUnitWrapper.dll -parallel collections -nocolor -noshadow -xml testResults.xml -trait TestGroup=JIT.jit64.mcc 
Unhandled exception. System.ArgumentException: file not found: JIT\jit64\JIT.jit64.XUnitWrapper.dll
   at Xunit.ConsoleClient.CommandLine.Parse(Predicate`1 fileExists) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/CommandLine.cs:line 217
   at Xunit.ConsoleClient.CommandLine..ctor(String[] args, Predicate`1 fileExists) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/CommandLine.cs:line 21
   at Xunit.ConsoleClient.CommandLine.Parse(String[] args) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/CommandLine.cs:line 110
   at Xunit.ConsoleClient.ConsoleRunner.EntryPoint(String[] args) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/ConsoleRunner.cs:line 31
   at Xunit.ConsoleClient.Program.Main(String[] args) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/Program.cs:line 16

There are about 20 xUnit wrappers getting generated in the Pri0 runs and all the others apparently succeeded; I also see in the step “Copy native components to test output folder” of the job CoreCLR Pri0 Test Run Windows_NT arm checked that the JIT.jit64.XUnitWrapper.dll is generated fine just like all the other wrappers.

@MattGal, Is there any magic you might be able to pull off to help us better understand what’s going on, whether it’s a reliability issue of the newly brought up machines or perhaps a particular machine, and / or how to investigate that further?

trylek commented 4 years ago

P.S. I tried to look whether the failures are confined to a single machine but I found at least two, DDARM64-056 and DDARM64-110.

jaredpar commented 4 years ago

Builds

Build	Pull Request	Test Failure Count
#521184	#32265	1
#521210	Rolling	2
#521211	#32227	2
#521254	Rolling	2
#521323	#32292	3
#521405	#32265	2
#521426	#32227	2
#521515	#32299	2
#521519	Rolling	2
#521556	#32302	2

Configurations

Windows_NT arm Checked @ Windows.10.Arm64v8.Open

Windows_NT arm Checked no_tiered_compilation @ Windows.10.Arm64v8.Open

Helix Logs

Build	Pull Request	Console
#521184	#32265
#521210	Rolling	console.7f2a7422.log
#521210	Rolling	console.694e6413.log
#521211	#32227	console.653c707e.log
#521211	#32227
#521254	Rolling	console.e360eac4.log
#521254	Rolling	console.f1862260.log
#521323	#32292	console.7ded7a33.log
#521323	#32292
#521323	#32292	console.88e0d943.log
#521405	#32265	console.5c11eead.log
#521405	#32265
#521426	#32227
#521426	#32227
#521515	#32299	console.9b5c908e.log
#521515	#32299	console.123bd29d.log
#521519	Rolling	console.677db533.log
#521519	Rolling
#521556	#32302	console.1fc286e8.log
#521556	#32302	console.30a0b3cc.log

jkotas commented 4 years ago

This is affecting large fraction of the PRs. I am trying to disable these tests in https://github.com/dotnet/runtime/pull/32372 until it is fixed.

jkotas commented 4 years ago

Disabling the mcc tests moved the failure to the next workitem. It means that the failure is not specific to mcc tests. The mcc tests just happen to be a victim due to ordering.

@trylek Can we disable the ARM runs until this is fixed?

ViktorHofer commented 4 years ago

Submitted a PR to disable coreclr's test execution on ARM: https://github.com/dotnet/runtime/pull/32404

ViktorHofer commented 4 years ago

Tomas already disabled them: https://github.com/dotnet/runtime/commit/d9bc547b2eaa31ae9fc7db470bdaeec321676458

trylek commented 4 years ago

FWIW, one thing occurred to me during my yesterday chat with Viktor: when I was standing up the queue of Galaxy Book laptops for .NET Native testing about 1 1/2 years ago, I was hitting weird reliability issues that I later found out to be caused by the fact that the Windows installation on these laptops was continually spewing some internal crash dumps onto the relatively small HDD that was soon overflowing.

I ended up talking to some Watson folks who recommended setting a magic environment variable which ultimately fixed that. I'm not saying this is necessarily the cause here but I can easily imagine that some of the weird symptoms like the absence of relationship to a particular workload or non-deterministic absence of logs could be explained by lack of disk space.

trylek commented 4 years ago

Adding link to the related older item for reference: https://github.com/dotnet/runtime/issues/1097

MattGal commented 4 years ago

Sorry for joining the party late, I am taking a look now.

trylek commented 4 years ago

No worries @MattGal, I have launched a fresh new run re-enabling the Windows ARM32 job so I should have a new set of results available shortly:

https://github.com/dotnet/runtime/pull/32819

MattGal commented 4 years ago

No worries @MattGal, I have launched a fresh new run re-enabling the Windows ARM32 job so I should have a new set of results available shortly:

32819

I don't think I actually need the new run, the most recent one died from running out of disk space. Investigation continues.

MattGal commented 4 years ago

I spent some time pondering JIT.jit64.mcc and the logs. It's clear that the problem is that we never really expected 3+ GB work item payloads, but we can make it work. NOte if it's slow to unzip on MY computer, it's slow to unzip on the helix laptops. Sample log.

Workitem payload zip: 693 MB (709,428KB) Correlation Payload zips: only big one is 248 MB (254,382KB)

= ~941 MB zipped (963,810‬ KB)

These zips have to keep existing until the work is finished.

Once unpacked: Work item zip goes from 693 MB -> 3.382 GB (3,571,486,720)

Because a work item a) might get rerun and b) might munge its own directory, we have two copies of this and re-copy from the "unzip" to "exec" folder every time.

Corelation payload zip goes from 248 -> 848 MB

That means just having this work item unpacked eats 3.382 + 3.382 + .848 + .963 GB = 8.575 gigs for just the work item, forgetting logs, dumps etc.

Things we should pursue:

Runtime folks: Consider splitting the Jit work item into multiple work items. If this were not 56 different tests zipped into the same giant zip it wouldn't have hit this issue.
.NET Core Engineering: We could get all the extra disk space we'd need by moving the Helix work directory on these machines to D:\, which has 60 GB free instead of 10.

trylek commented 4 years ago

I believe that @echesakovMSFT was working on partitioning CoreCLR tests into chunks. I remember from my .NET Native test migration to Helix that we ended up with vastly different characteristics of Intel vs. ARM work items in terms of size. During our chat in Redmond @jashook mentioned that the current design is very inflexible in terms of adjustable work item sizes. If this turns out to be a crucial factor for ARM testing, we might want to rethink some of the infra logic with new goals in mind like clean Mono support or tagging tests for OS independence.

echesakov commented 4 years ago

@trylek If we need to solve this issue now - we can specify a finer partitioning of JIT.jit64.mcc work item in src\coreclr\tests\testgrouping.proj. Since it's a MSBuild file you can also put conditions on $(BuildOS) and $(BuildArch) and limit this partitioning when targeting win-arm.

It's also doable to have a separate partitioning scheme for each combination of $(BuildOS) x $(BuildArch) if needed, so I am not sure what @jashook means by "very inflexible in terms of adjustable work item sizes.". On opposite, it's quite flexible - you can have a work item consisting of one test as well as a work item consisting of tests in multiple directories. However, the problem of figuring out the right partitioning scheme is hard, especially, when you want to minimize not only the time each work item takes to run but also time that is takes to upload/download a workitem payload and unpack it on a Helix machine. This is way we done this only testing on x64.

@MattGal By the way, if I remember right - the work item size is why I JIT/jit64 was split into multiple work items in the first place. All the test artifacts in jit64 directory on Windows takes roughly 6Gb and when we were bringing up the testing in Helix in coreclr this was too much even for x64.

MattGal commented 4 years ago

@echesakovMSFT thanks for clarifying. Once we fix up the machines they should have lots more space such that you don't have to change, but unless it will result in duplication of content in work item payloads more, smaller workitems will generally make it through helix faster.

MattGal commented 4 years ago

I've fixed up these machines so they have their work directory on the 60 GB free disk, so you now have 50 more GB to play with on the work directory. Do note with payloads this big that downloading and unzipping them is going to be a non-trivial part of their execution; not much we can do about that.

@trylek can you kick off a fresh run?

trylek commented 4 years ago

Thanks @MattGal. Closed & reopened the PR; the results are kind of weird - the summary in the PR indicates that the Windows legs are still running but in Azure it shows they failed. For the Windows ARM run, if I read the log correctly, it claims that it lost connection to the pool.

https://dev.azure.com/dnceng/public/_build/results?buildId=537911&view=logs&jobId=3ebbb5e8-da96-58d1-d7f8-eda9a2949a98&j=41021207-15b4-5953-02cc-987654ff0f7b

CoreCLR Pri0 Test Run Windows_NT arm checked:

##[error]We stopped hearing from agent NetCorePublic-Pool 8. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Pool: NetCorePublic-Pool
Agent: NetCorePublic-Pool 8
Started: Yesterday at 10:16 PM
Duration: 1h 31m 13s

Job preparation parameters
7 queue time variables used

MattGal commented 4 years ago

This is super weird. For now, let's just retry them and if it comes back I can investigate.

ViktorHofer commented 4 years ago

Failed again: https://dev.azure.com/dnceng/public/_build/results?buildId=572416&view=ms.vss-test-web.build-test-results-tab&runId=17950088&paneView=debug&resultId=100000

trylek commented 4 years ago

@MattGal - is there any way to double-check whether this is a one-off issue or a problem with a particular machine? This is the first occurrence in about 5 days so I'm not that scared yet but if this starts reproing on a more regular basis, I'll be strongly pushed to disable the ARM runs again. Thanks a lot!

MattGal commented 4 years ago

Yes, actually; it's not terribly hard to use Kusto queries to see if a particular machine is an outlier for your work, given enough work items. I'll take a peek.

My understand here is most of the "fix" was basically a refactoring of payloads to not be 700+ MB per work item; if that regressed on your side it could be relevant.

MattGal commented 4 years ago

The work item Viktor linked failed due to downloading and unpacking its payload filling the disk... so the "good" part here is that network speed isn't the problem (i.e. you were able in all cases as far as I can see to download the work, just not unpack it.

(Last log before deadletter)

2020-03-25T12:25:14.834Z    ERROR   executor(112)   run Unhandled exception attempting to download payloads
Traceback (most recent call last):
  File "C:\h\scripts\helix-scripts\helix\executor.py", line 108, in run
    self._download_workitem_payload(workitem_payload_archive_uri)
  File "C:\h\scripts\helix-scripts\helix\executor.py", line 170, in _download_workitem_payload
    copy_tree_to(self.workitem_payload_dir, self.workitem_root)
  File "C:\h\scripts\helix-scripts\helix\io.py", line 28, in copy_tree_to
    shutil.copy2(path, file_target_path)
  File "C:\python3.7.0\lib\shutil.py", line 257, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "C:\python3.7.0\lib\shutil.py", line 122, in copyfile
    copyfileobj(fsrc, fdst)
  File "C:\python3.7.0\lib\shutil.py", line 82, in copyfileobj
    fdst.write(buf)
OSError: [Errno 28] No space left on device
2020-03-25T12:25:14.850Z    INFO    executor(115)   run Exception downloading.  Closing file handle to D:\Users\runner\AppData\Local\Temp\helix_active_download_a0a5ebb5-c046-43b1-ae74-4d98ec602202.sem

Looking at the zip file for the work item it's still 694 MB zipped. @jashook was working on reducing this, are these runs perhaps missing his changes?

Querying general failures like this in the past week, there's no trend of any specific machine hitting this more often than others. Rather, your single work item's payload (ignoring all correlation payloads) is still well over 3 GB (Unpacking the one above shows it as 3.32 GB on my local computer). As we discussed before, since your "single" work item payloads are just lots of tests, the simplest and best fix is to split them up. I see something like 56 (split across 798 DLLs) distinct tests in this same work item. If you can figure out how to send that as two bursts of 28, your payload size will drop by approximately half. If you can figure out how to send that as four bursts of 14 work items, it will drop by 75%. If you make each test a distinct work item, payload size drops by a whopping 56-fold, and maximizes usage of the machines available.

trylek commented 4 years ago

Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy.

MattGal commented 4 years ago

Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy.

Chatting with Jarrett, he also reminded me of what I did earlier in this thread; something put the variable back to C:\ here. I can resurrect the sneaky trick I did to undo this while reaching out to DDFUN to understand why it may have regressed. Will update this thread once done.

MattGal commented 4 years ago

Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy.

I met with Jarett and he reminded me we'd already done this for other machines, just not evidently this queue. I've updated the machines again and discussed with DDFUN, so you should be unblocked. (Edit: Evidently some machines from the queue got re-imaged with old scripts, and the manual fixup steps where not followed, this is the fallout)

trylek commented 4 years ago

Awesome, thank you!

ViktorHofer commented 4 years ago

Presumably this isn't happening anymore. Closing. Feel free to reopen.

dotnet / runtime