Closed trylek closed 4 years ago
P.S. I tried to look whether the failures are confined to a single machine but I found at least two, DDARM64-056 and DDARM64-110.
Build | Pull Request | Test Failure Count |
---|---|---|
#521184 | #32265 | 1 |
#521210 | Rolling | 2 |
#521211 | #32227 | 2 |
#521254 | Rolling | 2 |
#521323 | #32292 | 3 |
#521405 | #32265 | 2 |
#521426 | #32227 | 2 |
#521515 | #32299 | 2 |
#521519 | Rolling | 2 |
#521556 | #32302 | 2 |
Build | Pull Request | Console | Core | Test Results |
---|---|---|---|---|
#521184 | #32265 | |||
#521210 | Rolling | console.7f2a7422.log | ||
#521210 | Rolling | console.694e6413.log | ||
#521211 | #32227 | console.653c707e.log | ||
#521211 | #32227 | |||
#521254 | Rolling | console.e360eac4.log | ||
#521254 | Rolling | console.f1862260.log | ||
#521323 | #32292 | console.7ded7a33.log | ||
#521323 | #32292 | |||
#521323 | #32292 | console.88e0d943.log | ||
#521405 | #32265 | console.5c11eead.log | ||
#521405 | #32265 | |||
#521426 | #32227 | |||
#521426 | #32227 | |||
#521515 | #32299 | console.9b5c908e.log | ||
#521515 | #32299 | console.123bd29d.log | ||
#521519 | Rolling | console.677db533.log | ||
#521519 | Rolling | |||
#521556 | #32302 | console.1fc286e8.log | ||
#521556 | #32302 | console.30a0b3cc.log |
This is affecting large fraction of the PRs. I am trying to disable these tests in https://github.com/dotnet/runtime/pull/32372 until it is fixed.
Disabling the mcc tests moved the failure to the next workitem. It means that the failure is not specific to mcc tests. The mcc tests just happen to be a victim due to ordering.
@trylek Can we disable the ARM runs until this is fixed?
Submitted a PR to disable coreclr's test execution on ARM: https://github.com/dotnet/runtime/pull/32404
Tomas already disabled them: https://github.com/dotnet/runtime/commit/d9bc547b2eaa31ae9fc7db470bdaeec321676458
FWIW, one thing occurred to me during my yesterday chat with Viktor: when I was standing up the queue of Galaxy Book laptops for .NET Native testing about 1 1/2 years ago, I was hitting weird reliability issues that I later found out to be caused by the fact that the Windows installation on these laptops was continually spewing some internal crash dumps onto the relatively small HDD that was soon overflowing.
I ended up talking to some Watson folks who recommended setting a magic environment variable which ultimately fixed that. I'm not saying this is necessarily the cause here but I can easily imagine that some of the weird symptoms like the absence of relationship to a particular workload or non-deterministic absence of logs could be explained by lack of disk space.
Adding link to the related older item for reference: https://github.com/dotnet/runtime/issues/1097
Sorry for joining the party late, I am taking a look now.
No worries @MattGal, I have launched a fresh new run re-enabling the Windows ARM32 job so I should have a new set of results available shortly:
No worries @MattGal, I have launched a fresh new run re-enabling the Windows ARM32 job so I should have a new set of results available shortly:
32819
I don't think I actually need the new run, the most recent one died from running out of disk space. Investigation continues.
I spent some time pondering JIT.jit64.mcc and the logs. It's clear that the problem is that we never really expected 3+ GB work item payloads, but we can make it work. NOte if it's slow to unzip on MY computer, it's slow to unzip on the helix laptops. Sample log.
Workitem payload zip: 693 MB (709,428KB) Correlation Payload zips: only big one is 248 MB (254,382KB)
= ~941 MB zipped (963,810 KB)
These zips have to keep existing until the work is finished.
Once unpacked: Work item zip goes from 693 MB -> 3.382 GB (3,571,486,720)
Because a work item a) might get rerun and b) might munge its own directory, we have two copies of this and re-copy from the "unzip" to "exec" folder every time.
Corelation payload zip goes from 248 -> 848 MB
That means just having this work item unpacked eats 3.382 + 3.382 + .848 + .963 GB = 8.575 gigs for just the work item, forgetting logs, dumps etc.
Things we should pursue:
I believe that @echesakovMSFT was working on partitioning CoreCLR tests into chunks. I remember from my .NET Native test migration to Helix that we ended up with vastly different characteristics of Intel vs. ARM work items in terms of size. During our chat in Redmond @jashook mentioned that the current design is very inflexible in terms of adjustable work item sizes. If this turns out to be a crucial factor for ARM testing, we might want to rethink some of the infra logic with new goals in mind like clean Mono support or tagging tests for OS independence.
@trylek If we need to solve this issue now - we can specify a finer partitioning of JIT.jit64.mcc work item in src\coreclr\tests\testgrouping.proj. Since it's a MSBuild file you can also put conditions on $(BuildOS) and $(BuildArch) and limit this partitioning when targeting win-arm.
It's also doable to have a separate partitioning scheme for each combination of $(BuildOS) x $(BuildArch) if needed, so I am not sure what @jashook means by "very inflexible in terms of adjustable work item sizes.". On opposite, it's quite flexible - you can have a work item consisting of one test as well as a work item consisting of tests in multiple directories. However, the problem of figuring out the right partitioning scheme is hard, especially, when you want to minimize not only the time each work item takes to run but also time that is takes to upload/download a workitem payload and unpack it on a Helix machine. This is way we done this only testing on x64.
@MattGal By the way, if I remember right - the work item size is why I JIT/jit64 was split into multiple work items in the first place. All the test artifacts in jit64 directory on Windows takes roughly 6Gb and when we were bringing up the testing in Helix in coreclr this was too much even for x64.
@echesakovMSFT thanks for clarifying. Once we fix up the machines they should have lots more space such that you don't have to change, but unless it will result in duplication of content in work item payloads more, smaller workitems will generally make it through helix faster.
I've fixed up these machines so they have their work directory on the 60 GB free disk, so you now have 50 more GB to play with on the work directory. Do note with payloads this big that downloading and unzipping them is going to be a non-trivial part of their execution; not much we can do about that.
@trylek can you kick off a fresh run?
Thanks @MattGal. Closed & reopened the PR; the results are kind of weird - the summary in the PR indicates that the Windows legs are still running but in Azure it shows they failed. For the Windows ARM run, if I read the log correctly, it claims that it lost connection to the pool.
CoreCLR Pri0 Test Run Windows_NT arm checked:
##[error]We stopped hearing from agent NetCorePublic-Pool 8. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610 Pool: NetCorePublic-Pool Agent: NetCorePublic-Pool 8 Started: Yesterday at 10:16 PM Duration: 1h 31m 13s Job preparation parameters 7 queue time variables used
This is super weird. For now, let's just retry them and if it comes back I can investigate.
@MattGal - is there any way to double-check whether this is a one-off issue or a problem with a particular machine? This is the first occurrence in about 5 days so I'm not that scared yet but if this starts reproing on a more regular basis, I'll be strongly pushed to disable the ARM runs again. Thanks a lot!
Yes, actually; it's not terribly hard to use Kusto queries to see if a particular machine is an outlier for your work, given enough work items. I'll take a peek.
My understand here is most of the "fix" was basically a refactoring of payloads to not be 700+ MB per work item; if that regressed on your side it could be relevant.
The work item Viktor linked failed due to downloading and unpacking its payload filling the disk... so the "good" part here is that network speed isn't the problem (i.e. you were able in all cases as far as I can see to download the work, just not unpack it.
2020-03-25T12:25:14.834Z ERROR executor(112) run Unhandled exception attempting to download payloads
Traceback (most recent call last):
File "C:\h\scripts\helix-scripts\helix\executor.py", line 108, in run
self._download_workitem_payload(workitem_payload_archive_uri)
File "C:\h\scripts\helix-scripts\helix\executor.py", line 170, in _download_workitem_payload
copy_tree_to(self.workitem_payload_dir, self.workitem_root)
File "C:\h\scripts\helix-scripts\helix\io.py", line 28, in copy_tree_to
shutil.copy2(path, file_target_path)
File "C:\python3.7.0\lib\shutil.py", line 257, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "C:\python3.7.0\lib\shutil.py", line 122, in copyfile
copyfileobj(fsrc, fdst)
File "C:\python3.7.0\lib\shutil.py", line 82, in copyfileobj
fdst.write(buf)
OSError: [Errno 28] No space left on device
2020-03-25T12:25:14.850Z INFO executor(115) run Exception downloading. Closing file handle to D:\Users\runner\AppData\Local\Temp\helix_active_download_a0a5ebb5-c046-43b1-ae74-4d98ec602202.sem
Looking at the zip file for the work item it's still 694 MB zipped. @jashook was working on reducing this, are these runs perhaps missing his changes?
Querying general failures like this in the past week, there's no trend of any specific machine hitting this more often than others. Rather, your single work item's payload (ignoring all correlation payloads) is still well over 3 GB (Unpacking the one above shows it as 3.32 GB on my local computer). As we discussed before, since your "single" work item payloads are just lots of tests, the simplest and best fix is to split them up. I see something like 56 (split across 798 DLLs) distinct tests in this same work item. If you can figure out how to send that as two bursts of 28, your payload size will drop by approximately half. If you can figure out how to send that as four bursts of 14 work items, it will drop by 75%. If you make each test a distinct work item, payload size drops by a whopping 56-fold, and maximizes usage of the machines available.
Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy.
Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy.
Chatting with Jarrett, he also reminded me of what I did earlier in this thread; something put the variable back to C:\ here. I can resurrect the sneaky trick I did to undo this while reaching out to DDFUN to understand why it may have regressed. Will update this thread once done.
Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy.
I met with Jarett and he reminded me we'd already done this for other machines, just not evidently this queue. I've updated the machines again and discussed with DDFUN, so you should be unblocked. (Edit: Evidently some machines from the queue got re-imaged with old scripts, and the manual fixup steps where not followed, this is the fallout)
Awesome, thank you!
Presumably this isn't happening anymore. Closing. Feel free to reopen.
After I enabled Windows arm32 runs using the new Galaxy Book laptop queue (Windows.10.Arm64v8.Open), we’re starting to monitor the first errors on that queue 😊. We now see a weirdly systematic error in the “JIT.jit64.mcc” work item, for instance in this run:
https://dev.azure.com/dnceng/public/_build/results?buildId=521323&view=logs&jobId=6c46bee0-e095-5eff-8d48-d352951d0d7b
It has two different manifestations: either the Helix log is not available at all (in the quoted run this is the case for the “no_tiered_compilation” flavor of the Windows arm32 job), or it’s present (like in the other Windows arm32 job in the same run) and complains about the missing XUnit wrapper for the test:
There are about 20 xUnit wrappers getting generated in the Pri0 runs and all the others apparently succeeded; I also see in the step “Copy native components to test output folder” of the job
CoreCLR Pri0 Test Run Windows_NT arm checked
that theJIT.jit64.XUnitWrapper.dll
is generated fine just like all the other wrappers.@MattGal, Is there any magic you might be able to pull off to help us better understand what’s going on, whether it’s a reliability issue of the newly brought up machines or perhaps a particular machine, and / or how to investigate that further?