Closed ThadHouse closed 4 years ago
I believe we are seeing the same problem here.
We started getting some random failures, the code did not change on our side but something seems to have changed on the image, comparing logs we noticed the following.
When we call the dir
command we get slightly different outputs
2020-04-08T04:51:40.9092142Z Volume in drive D is Temporary Storage
2020-04-08T04:51:40.9093101Z Volume Serial Number is 4099-6268
...
2020-04-08T04:51:42.0603173Z rootdir: d:\a\1\s\src
2020-04-08T04:51:13.6285940Z Volume in drive D is Temp
2020-04-08T04:51:13.6286720Z Volume Serial Number is 7AAE-11A5
...
2020-04-08T04:51:14.7481347Z rootdir: D:\a\1\s\src
At least drive D seems to be slightly different, rootdir is now also lowercase, Windows path does not differentiate between lowercase and uppercase right?
We consistently get dll loading problems when drive D is Temporary Storage
.
Could someone shed a bit of light on this?
Just in case it helps, the problems we are getting are when loading a dll via python, not sure if this matters.
OSError: cannot load library 'myfile.dll': error 0x7e. Additionally, ctypes.util.find_library() did not manage to locate a library called
I was able to get a full loader log of my issue. The library being loaded from Java is wpiutiljni.dll
, which is being loaded via a full path. That library depends on wpiutild.dll
, and when loading the loader seems to be entirely skipping the current working directory, which the top of the log says is the directory where both dlls are located at.
The most interesting load log I think is attempting to load from C:\Program F\wpiutild.dll
. There are 65 paths it checks before that one. I wonder if whats happening is that the search path is getting too long, and overfilling a buffer.
Sorry for the multiple posts. I was able to completely separate the failing parts into their own repo.
Here is the azure build project. When working correctly, this works just fine https://dev.azure.com/thadhouse/AzureTest/_build?definitionId=19&_a=summary
Here is the project for that build. https://github.com/ThadHouse/AzureMFTest
Nothing fancy, just a dll that depends on another dll in the same folder. Works on local machines no problems, just not on the current azure builds.
Ibelieve the issue is not only Java specific, we have problems loading dlls from python too, is it maybe a more generic problem?
@FrancescElies could you provide repro steps, please?
More Info:
Using dumpbin we get following infos from our dll, as you can see we do not depend on any other dlls than microsoft.
Dump of file ourfile.dll
File Type: DLL
Image has the following dependencies:
VCRUNTIME140.dll
api-ms-win-crt-stdio-l1-1-0.dll
api-ms-win-crt-runtime-l1-1-0.dll
api-ms-win-crt-string-l1-1-0.dll
api-ms-win-crt-heap-l1-1-0.dll
api-ms-win-crt-environment-l1-1-0.dll
api-ms-win-crt-math-l1-1-0.dll
KERNEL32.dll
@FrancescElies I wonder if it's related https://github.com/actions/virtual-environments/issues/682
@FrancescElies could you provide repro steps, please?
Well the repo we are working on is private, I can't paste this kind of info here.
But what I can say is, triggering same pipeline for the same commit in two different moments in time have different outputs, sometimes failed, sometimes pass.
We could see small differences in the logs as mentioned in https://github.com/actions/virtual-environments/issues/691#issuecomment-610775130
Something is different, but our code is the same, did something change in image vs2017-win2016
?
@miketimofeev ist there any private channel where one can get support where reproduction steps could be provided?
@FrancescElies First of all could you please file another issue for vs2017-win2016? It seems @ThadHouse one related to vs2019 only.
@ThadHouse which image are you using? I could not see it in this post. We started having dll load problems yesterday so I thought this was the right place to post it.
@miketimofeev Do we still want a separate issue?
@FrancescElies the builds provided by @ThadHouse run on vs2019, but both VM images were updated yesterday. By the way, can you try to use disk C and check is the issue still persist?
No where in our pipeline we are hard-coding something like D:\a\1\s\src
this is where our code is being checked out by the checkout
,
In many places we use $(Build.SourcesDirectory)
which is a provided predefined variable, where that is is decided by Azure.
In the docs Azure also explicitly say not to hard-code such things.
By the way the builds where it worked it was also disk D.
@FrancescElies First of all could you please file another issue for vs2017-win2016? It seems @ThadHouse one related to vs2019 only.
New issue filled (#692)
@miketimofeev Moving to the C drive does not seem to fix anything.
@ThadHouse looks like your case is not related to the temporary drive because you haven't seen any successful builds since 1 AM on 4/7/20, right?
Our main build is actually here,
https://dev.azure.com/wpilib/wpilib/_build?definitionId=26&_a=summary
and the last build that was successful (in relation to this issue, it was a failure for different reasons) was #20200406.8
, which was Monday 4/6 at 10:30 AM PDT. #20200406.10
at 12:21 PM PDT that same day was our first failure related to this issue, and continuous failures ever since.
@ThadHouse We've forked your test project https://github.com/ThadHouse/AzureMFTest, but can't repro the issue - builds are fine. Could you make another run here, please? https://dev.azure.com/thadhouse/AzureTest/_build?definitionId=19&_a=summary
@miketimofeev Just ran a build from master, and it failed.
@miketimofeev After spending some more time, I was able to completely remove Java from the equation, and can reproduce it entirely with C++. How to reproduce is have the library in the current directory, but not the exe. Run the exe, and it won't search the current directory for the library like is supposed to happen.
What my guess is is that the PATH for the systems is too long. The last directory that attempts to get loaded is C:\Program F\wpiutild.dll
. I think the path is getting cut off by the loader, causing the failure.
@ThadHouse thanks for such deep diving! We have slightly different settings across environments and that might cause different path behavior. We'll investigate a way to fix it.
Could you print your %PATH% variable, please?
Done. Even there it looks like its overflowing.
Apparently all environmental variables combined must be less then 32767 characters. I wonder if with how many variables azure is adding its getting cut off.
https://devblogs.microsoft.com/oldnewthing/20100203-00/?p=15083
Hello,@ThadHouse
Could you please test your build with PATH that doesn't contain trailing space? For example powershell task:
$env:Path = $env:Path.Split(";").Trim() -join ";"
@al-cheb Doesn't help. Build link below where I set the PATH to just the windows directory. And still get the same failure. It does get rid of all the extra path reads though, so maybe a PATH issue is chasing the wrong thing.
Something interesting about my latest build is
17a8:179c @ 00262437 - LdrpComputeLazyDllPath - INFO: DLL search path computed: D:\a\1\s\NativeApp;C:\Program Files\Microsoft SQL Server\150\Shared\;C:\windows\SYSTEM32;C:\windows\system;C:\windows;C:\Program Files (x86)\Windows Kits\10\Debuggers\x64;C:\windows\system32;C:\windows;
That's the search path by the loader, and its completely missing the current working directory, which is where the DLL is located. At the top of the log, the current directory is the directory the DLL is located in. I see nothing in the windows docs that would cause the current working directory to be completely ignored. Especially something caused by an update. There has to be some sort of change made to the images in the 2 hours between 10:30 AM PDT and 12:21 PM PDT on 4/6. Our builds started failing there, with absolutely no change to the actual build. Maybe a windows update or something?
@ThadHouse Just to double check. Could you please copy wpiutild.dll to D:\a\1\s\NativeApp?
@al-cheb Yup, that works. So the DLL and EXE are definitely compatible with each other.
@ThadHouse, That's odd why the current path doesn't include by default. Does the set PATH=%CD%;%PATH%
command work?
I can try tomorrow. But according to the windows library loader documentation, the current directory should definitely be searched, independent of it being in path or not. I tried digging for a few hours today, and could find no way to make that step be skipped.
@al-cheb Setting CD in Path does work correctly, as expected. PATH is getting searched, just not the current directory as expected.
@ThadHouse we've found the root cause - it's SQL server express installation https://github.com/actions/virtual-environments/pull/568. Looks like it adds some libraries dynamically so the DLL search path is being changed.
Without SQL Server:
0148:0420 @ 00129046 - LdrpComputeLazyDllPath - INFO: DLL search path computed: D:\a\1\s\NativeApp;C:\windows\SYSTEM32;C:\windows\system;C:\windows;.;C:\Program Files (x86)\Windows Kits\10\Debuggers\x64;C:\agents\2.165.2\externals\git\cmd;C:\Program Files\Mercurial\;C:\ProgramData\kind;C:\vcpkg;C:\cf-cli;C:\Program Files (x86)\NSIS\;C:\Program Files\Mercurial\;C:\Program Files\dotnet;C:\mysql-5.7.21-winx64\bin;C:\Program Files\Java\zulu-8-azure-jdk_8.40.0.25-8.0.222-win_x64\bin;C:\SeleniumWebDrivers\Gecko
With SQL Server:
2020-04-10T11:37:52.0012145Z 04c4:0294 @ 16905093 - LdrpComputeLazyDllPath - INFO: DLL search path computed: D:\a\1\s\NativeApp;C:\Program Files\Microsoft SQL Server\150\Shared\;C:\windows\SYSTEM32;C:\windows\system;C:\windows;C:\Program Files (x86)\Windows Kits\10\Debuggers\x64;C:\agents\2.165.2\externals\git\cmd;C:\Program Files\Mercurial\;C:\ProgramData\kind;C:\vcpkg;C:\cf-cli;C:\Program Files (x86)\NSIS\;C:\Program Files\Mercurial\;C:\Program Files\dotnet;C:\mysql-5.7.21-winx64\bin;C:\Program Files\Java\zulu-8-azure-jdk_8.40.0.25
We're going to remove it from the image to unblock everyone and continue further investigation
Wow. Thank you for finding that. That's awful behavior for an app to complete change that during the loading process.
Sorry for the double post. Do you know how long this will take to propagate to users?
@ThadHouse I'm afraid not so quick. By the end of the next week or so.
@ThadHouse we've deployed the image with the fix. Could you please check the behavior?
@ThadHouse I'm going to close the issue since there are no new issues after the fix. Feel free to ping me if you have any concerns. Thank you!
Did the windows build agents get updated on 4/6/20 without a changelog being posted? Our PR builds were working until about 1 AM on 4/7/20, and now they all fail with library load errors. Trying to find what caused this, as we have not committed anything new in days, and after 8 hours am still stuck. Would gladly appreciate some help in figuring out if anything changed on Azures end.