actions / runner-images

GitHub Actions runner images
MIT License
9.17k stars 2.84k forks source link

Windows build changes without a release log #691

Closed ThadHouse closed 4 years ago

ThadHouse commented 4 years ago

Did the windows build agents get updated on 4/6/20 without a changelog being posted? Our PR builds were working until about 1 AM on 4/7/20, and now they all fail with library load errors. Trying to find what caused this, as we have not committed anything new in days, and after 8 hours am still stuck. Would gladly appreciate some help in figuring out if anything changed on Azures end.

FrancescElies commented 4 years ago

I believe we are seeing the same problem here.

We started getting some random failures, the code did not change on our side but something seems to have changed on the image, comparing logs we noticed the following.

When we call the dir command we get slightly different outputs

At least drive D seems to be slightly different, rootdir is now also lowercase, Windows path does not differentiate between lowercase and uppercase right?

We consistently get dll loading problems when drive D is Temporary Storage.

Could someone shed a bit of light on this?

FrancescElies commented 4 years ago

Just in case it helps, the problems we are getting are when loading a dll via python, not sure if this matters.

OSError: cannot load library 'myfile.dll': error 0x7e.  Additionally, ctypes.util.find_library() did not manage to locate a library called 
ThadHouse commented 4 years ago

I was able to get a full loader log of my issue. The library being loaded from Java is wpiutiljni.dll, which is being loaded via a full path. That library depends on wpiutild.dll, and when loading the loader seems to be entirely skipping the current working directory, which the top of the log says is the directory where both dlls are located at.

FullLog.txt

ThadHouse commented 4 years ago

The most interesting load log I think is attempting to load from C:\Program F\wpiutild.dll. There are 65 paths it checks before that one. I wonder if whats happening is that the search path is getting too long, and overfilling a buffer.

ThadHouse commented 4 years ago

Sorry for the multiple posts. I was able to completely separate the failing parts into their own repo.

Here is the azure build project. When working correctly, this works just fine https://dev.azure.com/thadhouse/AzureTest/_build?definitionId=19&_a=summary

Here is the project for that build. https://github.com/ThadHouse/AzureMFTest

Nothing fancy, just a dll that depends on another dll in the same folder. Works on local machines no problems, just not on the current azure builds.

FrancescElies commented 4 years ago

image

Ibelieve the issue is not only Java specific, we have problems loading dlls from python too, is it maybe a more generic problem?

miketimofeev commented 4 years ago

@FrancescElies could you provide repro steps, please?

FrancescElies commented 4 years ago

More Info:

Using dumpbin we get following infos from our dll, as you can see we do not depend on any other dlls than microsoft.

Dump of file ourfile.dll

File Type: DLL

  Image has the following dependencies:

    VCRUNTIME140.dll
    api-ms-win-crt-stdio-l1-1-0.dll
    api-ms-win-crt-runtime-l1-1-0.dll
    api-ms-win-crt-string-l1-1-0.dll
    api-ms-win-crt-heap-l1-1-0.dll
    api-ms-win-crt-environment-l1-1-0.dll
    api-ms-win-crt-math-l1-1-0.dll
    KERNEL32.dll
miketimofeev commented 4 years ago

@FrancescElies I wonder if it's related https://github.com/actions/virtual-environments/issues/682

FrancescElies commented 4 years ago

@FrancescElies could you provide repro steps, please?

Well the repo we are working on is private, I can't paste this kind of info here.

But what I can say is, triggering same pipeline for the same commit in two different moments in time have different outputs, sometimes failed, sometimes pass.

We could see small differences in the logs as mentioned in https://github.com/actions/virtual-environments/issues/691#issuecomment-610775130

Something is different, but our code is the same, did something change in image vs2017-win2016?

682 It could be related to that one, but we don't get any compiling errors, one bit of info that might be relevant is that we get things compiled in a different machine, though this does not explain why it sometimes works and others doesn't.

FrancescElies commented 4 years ago

@miketimofeev ist there any private channel where one can get support where reproduction steps could be provided?

miketimofeev commented 4 years ago

@FrancescElies First of all could you please file another issue for vs2017-win2016? It seems @ThadHouse one related to vs2019 only.

FrancescElies commented 4 years ago

@ThadHouse which image are you using? I could not see it in this post. We started having dll load problems yesterday so I thought this was the right place to post it.

@miketimofeev Do we still want a separate issue?

miketimofeev commented 4 years ago

@FrancescElies the builds provided by @ThadHouse run on vs2019, but both VM images were updated yesterday. By the way, can you try to use disk C and check is the issue still persist?

FrancescElies commented 4 years ago

No where in our pipeline we are hard-coding something like D:\a\1\s\src this is where our code is being checked out by the checkout,

In many places we use $(Build.SourcesDirectory) which is a provided predefined variable, where that is is decided by Azure.

In the docs Azure also explicitly say not to hard-code such things. image

By the way the builds where it worked it was also disk D.

FrancescElies commented 4 years ago

@FrancescElies First of all could you please file another issue for vs2017-win2016? It seems @ThadHouse one related to vs2019 only.

New issue filled (#692)

ThadHouse commented 4 years ago

@miketimofeev Moving to the C drive does not seem to fix anything.

miketimofeev commented 4 years ago

@ThadHouse looks like your case is not related to the temporary drive because you haven't seen any successful builds since 1 AM on 4/7/20, right?

ThadHouse commented 4 years ago

Our main build is actually here,

https://dev.azure.com/wpilib/wpilib/_build?definitionId=26&_a=summary

and the last build that was successful (in relation to this issue, it was a failure for different reasons) was #20200406.8, which was Monday 4/6 at 10:30 AM PDT. #20200406.10 at 12:21 PM PDT that same day was our first failure related to this issue, and continuous failures ever since.

miketimofeev commented 4 years ago

@ThadHouse We've forked your test project https://github.com/ThadHouse/AzureMFTest, but can't repro the issue - builds are fine. Could you make another run here, please? https://dev.azure.com/thadhouse/AzureTest/_build?definitionId=19&_a=summary

ThadHouse commented 4 years ago

@miketimofeev Just ran a build from master, and it failed.

https://dev.azure.com/thadhouse/AzureTest/_build/results?buildId=671&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=9c939e41-62c2-5605-5e05-fc3554afc9f5

ThadHouse commented 4 years ago

@miketimofeev After spending some more time, I was able to completely remove Java from the equation, and can reproduce it entirely with C++. How to reproduce is have the library in the current directory, but not the exe. Run the exe, and it won't search the current directory for the library like is supposed to happen.

What my guess is is that the PATH for the systems is too long. The last directory that attempts to get loaded is C:\Program F\wpiutild.dll. I think the path is getting cut off by the loader, causing the failure.

miketimofeev commented 4 years ago

@ThadHouse thanks for such deep diving! We have slightly different settings across environments and that might cause different path behavior. We'll investigate a way to fix it.

miketimofeev commented 4 years ago

Could you print your %PATH% variable, please?

ThadHouse commented 4 years ago

https://dev.azure.com/thadhouse/AzureTest/_build/results?buildId=677&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=f8ed7bd8-2a7f-56f6-9385-7fc29a8b5b7b

Done. Even there it looks like its overflowing.

ThadHouse commented 4 years ago

Apparently all environmental variables combined must be less then 32767 characters. I wonder if with how many variables azure is adding its getting cut off.

https://devblogs.microsoft.com/oldnewthing/20100203-00/?p=15083

al-cheb commented 4 years ago

Hello,@ThadHouse Could you please test your build with PATH that doesn't contain trailing space? For example powershell task: $env:Path = $env:Path.Split(";").Trim() -join ";"

ThadHouse commented 4 years ago

@al-cheb Doesn't help. Build link below where I set the PATH to just the windows directory. And still get the same failure. It does get rid of all the extra path reads though, so maybe a PATH issue is chasing the wrong thing.

https://dev.azure.com/thadhouse/AzureTest/_build/results?buildId=678&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=5caf77c8-9b10-50ef-b5c7-ca89c63e1c86

ThadHouse commented 4 years ago

Something interesting about my latest build is

17a8:179c @ 00262437 - LdrpComputeLazyDllPath - INFO: DLL search path computed: D:\a\1\s\NativeApp;C:\Program Files\Microsoft SQL Server\150\Shared\;C:\windows\SYSTEM32;C:\windows\system;C:\windows;C:\Program Files (x86)\Windows Kits\10\Debuggers\x64;C:\windows\system32;C:\windows;

That's the search path by the loader, and its completely missing the current working directory, which is where the DLL is located. At the top of the log, the current directory is the directory the DLL is located in. I see nothing in the windows docs that would cause the current working directory to be completely ignored. Especially something caused by an update. There has to be some sort of change made to the images in the 2 hours between 10:30 AM PDT and 12:21 PM PDT on 4/6. Our builds started failing there, with absolutely no change to the actual build. Maybe a windows update or something?

al-cheb commented 4 years ago

@ThadHouse Just to double check. Could you please copy wpiutild.dll to D:\a\1\s\NativeApp?

ThadHouse commented 4 years ago

@al-cheb Yup, that works. So the DLL and EXE are definitely compatible with each other.

https://dev.azure.com/thadhouse/AzureTest/_build/results?buildId=679&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=12f1170f-54f2-53f3-20dd-22fc7dff55f9

al-cheb commented 4 years ago

@ThadHouse, That's odd why the current path doesn't include by default. Does the set PATH=%CD%;%PATH% command work?

ThadHouse commented 4 years ago

I can try tomorrow. But according to the windows library loader documentation, the current directory should definitely be searched, independent of it being in path or not. I tried digging for a few hours today, and could find no way to make that step be skipped.

ThadHouse commented 4 years ago

@al-cheb Setting CD in Path does work correctly, as expected. PATH is getting searched, just not the current directory as expected.

miketimofeev commented 4 years ago

@ThadHouse we've found the root cause - it's SQL server express installation https://github.com/actions/virtual-environments/pull/568. Looks like it adds some libraries dynamically so the DLL search path is being changed.

Without SQL Server: 0148:0420 @ 00129046 - LdrpComputeLazyDllPath - INFO: DLL search path computed: D:\a\1\s\NativeApp;C:\windows\SYSTEM32;C:\windows\system;C:\windows;.;C:\Program Files (x86)\Windows Kits\10\Debuggers\x64;C:\agents\2.165.2\externals\git\cmd;C:\Program Files\Mercurial\;C:\ProgramData\kind;C:\vcpkg;C:\cf-cli;C:\Program Files (x86)\NSIS\;C:\Program Files\Mercurial\;C:\Program Files\dotnet;C:\mysql-5.7.21-winx64\bin;C:\Program Files\Java\zulu-8-azure-jdk_8.40.0.25-8.0.222-win_x64\bin;C:\SeleniumWebDrivers\Gecko

With SQL Server: 2020-04-10T11:37:52.0012145Z 04c4:0294 @ 16905093 - LdrpComputeLazyDllPath - INFO: DLL search path computed: D:\a\1\s\NativeApp;C:\Program Files\Microsoft SQL Server\150\Shared\;C:\windows\SYSTEM32;C:\windows\system;C:\windows;C:\Program Files (x86)\Windows Kits\10\Debuggers\x64;C:\agents\2.165.2\externals\git\cmd;C:\Program Files\Mercurial\;C:\ProgramData\kind;C:\vcpkg;C:\cf-cli;C:\Program Files (x86)\NSIS\;C:\Program Files\Mercurial\;C:\Program Files\dotnet;C:\mysql-5.7.21-winx64\bin;C:\Program Files\Java\zulu-8-azure-jdk_8.40.0.25

We're going to remove it from the image to unblock everyone and continue further investigation

ThadHouse commented 4 years ago

Wow. Thank you for finding that. That's awful behavior for an app to complete change that during the loading process.

ThadHouse commented 4 years ago

Sorry for the double post. Do you know how long this will take to propagate to users?

miketimofeev commented 4 years ago

@ThadHouse I'm afraid not so quick. By the end of the next week or so.

miketimofeev commented 4 years ago

@ThadHouse we've deployed the image with the fix. Could you please check the behavior?

miketimofeev commented 4 years ago

@ThadHouse I'm going to close the issue since there are no new issues after the fix. Feel free to ping me if you have any concerns. Thank you!