Closed dsyme closed 4 years ago
Repros on RC1 for me
@dsyme, this is really cool. I will see what I can do.
This repros quite nicely on Windows: I get ...
> #r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;
[Loading C:\Users\codec\AppData\Local\Temp\nuget\4456--e5f48bb6-ed4f-4c75-ba1c-dba1ef125698\Project.fsproj.fsx]
namespace FSI_0002.Project
>
- open DiffSharp
- dsharp.config(backend=Backend.Torch)
- let t = dsharp.tensor [ 0 .. 10 ];;
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.core\1.0.0-preview-258177528\lib\netstandard2.1\DiffSharp.Core.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.backends.torch\1.0.0-preview-258177528\lib\netcoreapp3.0\DiffSharp.Backends.Torch.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\torchsharp\0.3.52276\lib\netcoreapp3.0\TorchSharp.dll'...
System.DllNotFoundException: Unable to load DLL 'C:/Users/codec/.nuget/packages/torchsharp/0.3.52276/runtimes\win-x64\native\LibTorchSharp.dll' or one of its dependencies: The specified module could not be found. (0x8007007E)
at System.Runtime.Loader.AssemblyLoadContext.InternalLoadUnmanagedDllFromPath(String unmanagedDllPath)
at System.Runtime.Loader.AssemblyLoadContext.LoadUnmanagedDllFromPath(String unmanagedDllPath)
at Microsoft.DotNet.DependencyManager.NativeAssemblyLoadContext.LoadNativeLibrary(String path) in C:\kevinransom\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 46
at Microsoft.DotNet.DependencyManager.NativeDllResolveHandlerCoreClr._resolveUnmanagedDll(Assembly _arg1, String name) in C:\kevinransom\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 114
at <StartupCode$Microsoft-DotNet-DependencyManager>.$NativeDllResolveHandler.-ctor@120-2.Invoke(Assembly delegateArg0, String delegateArg1) in C:\kevinransom\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 120
at System.Runtime.Loader.AssemblyLoadContext.GetResolvedUnmanagedDll(Assembly assembly, String unmanagedDllName)
at System.Runtime.Loader.AssemblyLoadContext.ResolveUnmanagedDllUsingEvent(String unmanagedDllName, Assembly assembly, IntPtr gchManagedAssemblyLoadContext)
at TorchSharp.Tensor.FloatTensor.THSTensor_newFloatScalar(Single scalar, Boolean requiresGrad)
at TorchSharp.Tensor.FloatTensor.From(Single scalar, Boolean requiresGrad)
at <StartupCode$DiffSharp-Backends-Torch>.$Torch.RawTensor.-ctor@900-1.Invoke(Single v)
at DiffSharp.Backends.Torch.TorchStatics`2.CreateFromFlatArray(Array values, Int32[] shape, Device device)
at DiffSharp.Tensor.create(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
at DiffSharp.dsharp.config(FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
at <StartupCode$FSI_0003>.$FSI_0003.main@()
Stopped due to error
>
-
So ...transitive native dependencies are not resolvable using the native resolution event handler. I am fairly confident that only works for managed code that has native dependencies. It looks to me like the design was primarily designed to make pinvoke work.
@jkotas
With the AssemblyLoadContext.ResolvingUnmanagedDll event handler, we are not being notified for
native dll load attempts that are caused by a native dependency of a native library on either windows or linux. Is that "ByDesign" or is there a mechanism we can use that will allow us to detect attempts to transitively load native .dlls.
In our example above we have a managed library that loads a native library 'LibTorchSharp.dll' Which itself has a native dependency to a library: torch_cpu.dll
We get notified to locate the torchsharp dependency but not the torch_cpu.dll one.
Not that it will be much help but our handler is here: https://github.com/dotnet/fsharp/blob/main/src/fsharp/Microsoft.DotNet.DependencyManager/NativeDllResolveHandler.fs#L89
It is by design.
The native library loader is part of OS. It does not expose events to resolve dependencies like this.
Different OSes provide assorted OS-specific mechanisms to help with this scenarios. For example, there is SetDllDirectoryW on Windows or RPATH on Unix.
@jkotas , thanks mate, that was what I expected.
@dsyme - if you are okay bundling the libtorch native libs with torchsharp then it will work fine:
It produced this.
c:\kevinransom\fsharp>dotnet artifacts\bin\fsi\Debug\netcoreapp3.1\fsi.exe --langversion:preview
Microsoft (R) F# Interactive version 11.0.0.0 for F# 5.0
Copyright (c) Microsoft Corporation. All Rights Reserved.
For help type #help;;
> #r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;
[Loading C:\Users\codec\AppData\Local\Temp\nuget\12316--989cf7ca-6ba9-4aab-a922-2bf875d5a299\Project.fsproj.fsx]
namespace FSI_0002.Project
>
- open DiffSharp
- dsharp.config(backend=Backend.Torch)
- let t = dsharp.tensor [ 0 .. 10 ];;
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.core\1.0.0-preview-258177528\lib\netstandard2.1\DiffSharp.Core.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.backends.torch\1.0.0-preview-258177528\lib\netcoreapp3.0\DiffSharp.Backends.Torch.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\torchsharp\0.3.52276\lib\netcoreapp3.0\TorchSharp.dll'...
val t : DiffSharp.Tensor =
Tensor
[0.000000, 1.000000, 2.000000, 3.000000, 4.000000, 5.000000, 6.000000, 7.000000, 8.000000, 9.000000, 10.000000]
>
I won't submit a pr, i'm not sure how your build works.
Closing as external
OK, thank you, I'll find some kind of resolution.
@cartermp The packages work OK with applications. From what I see there is nothing wrong with the packages as such - the problem is with our dynamic loader, which doesn't handle transitive native references. (Applications handle this by copying all native DLLs to the one directory on build)
I'm not saying it's easy to fix, but it feels like the problem is with us, and could hit us with any packages that rely on transitive native dependencies, so I'll reopen the bug if that's ok.
That said I will try to find a workaround to arrange the TorchSharp native packages so they are non-transitive.
Different OSes provide assorted OS-specific mechanisms to help with this scenarios. For example, there is SetDllDirectoryW on Windows or RPATH on Unix.
@KevinRansom Given that in F#/.NET Interactive we are loading DLLs directly from the package directories, it does kind of feel like we should be using these mecahnisms to augment the native loader load paths. Hard to see any other systematic way to solve this
@jkotas Did you mean AddDllDirectory?
@dsyme - if you are okay bundling the libtorch native libs with torchsharp then it will work fine:
@KevinRansom Unfortunately this is not a practical solution.
There are multiple different runtime native DLLs that work with the same managed DLL - basically CPU and GPU - the end application selects one
The collected native DLLs are too large to fit in one nuget package - they are about 1.5GB for GPU for example. So they must be delivered in multiple packages, because in practice both nuget.org and Azure CI and other things place limits on nuget package size around 200MB.
Tricky problem
I've documented this from the TorchSharp perspective here: https://github.com/xamarin/TorchSharp/issues/169
(I can see that we're not going to make this a high-priority thing for .NET Interactive and F# Interactive unless we hit other packages that have transitive native references.)
Yes, this feels like a very niche thing that is low severity
@jkotas Did you mean AddDllDirectory?
You are right. AddDllDirectory
would be more appropriate for this.
@dsyme, it shouldn't be too hard to make a change to also use this mechanism, l will put something together, hopefully over the weekend. You can let me know if it works for.
So ... my linux is not great, however, it seems that rpath is a string embedded into the library that has a dependency. This would require TorchSharp to embed this string for Linux, and to the best of my knowledge the Windows dll loader has no equivalent, so we would still need a windows solution.
The linux equivalent of AddDllDirectory is probably LD_LIBRARY_PATH. Which I can set after package resolution, but before dll load. Because it is an environment variable, if developers use fsi to spawn new processes they are also going to see this variable, which is somewhat dll hellish. Although I suppose I could swap it in before we do the load, and back out afterwards. Given that it is a dll load operation, that is bound to be vastly more expensive than swapping out an environment variable.
@jkotas , @dsyme could I ask you both to check my PR, if an when I implement it, and see if it is not too terrible. Thanks
Kevin
@dsyme, it shouldn't be too hard to make a change to also use this mechanism, l will put something together, hopefully over the weekend. You can let me know if it works for.
It's ok, don't worry. I've come to the conclusion that all these native DLLs need to be in the same directory anyway. They register the "torch implementation directory" in a common registry in some way, and it looks like they all have to be in the same place otherwise we get whacky errors like "Key already registered with the same priority: GroupSpatialSoftmax"
I'll think about what to do. Awkward but hey
Okay mate, may I close this issue?
it is an environment variable, if developers use fsi to spawn new processes they are also going to see this variable, which is somewhat dll hellish.
Also, setting process environment variables is not thread safe on Unix that comes with its own set of problems...
For really off-the-wall shenanigans, from what I'm reading you could use patchelf
to change the rpath for a binary before loading it as well. You'd probably want to do some kind of shadow-copy of the binary so that you could munge it without clobbering, though.
@baronfel, lol. That would be super cool but we would prefer not to copy files in #r, if we copied files, we would shove them in the same directory, and wouldn't have an issue. I am sort of thinking of adding an option that will do that for these really tricky scenarios. However, Don doesn't need it anymore so I'm not going to rush to do something, even real cool nerdy stuff :-)
Okay mate, may I close this issue?
This is still, I think, in some sense a bug in the F# and .NET Interactive loading experience of packages. I think we can only close the issue if we document the limitation.
Are these docs up-to-date? They look a bit dated at first glance? https://github.com/fsharp/fslang-design/blob/master/tooling/FST-1027-fsi-references.md
They don't discuss native dependency resolution at all. If we think it's a bug, I can prepare a fix, I would rather fix it now, than in two years time, when I've forgotten all of this stuff.
Could you please update the specs to include information on #r for packages with native dependencies ? The specs should say what is meant to work and what isn't - right now it's a little hard to tell what the intended spec is.
Here's an approximate spec, maybe you can work from this?
Dynamic loading of packages containing native DLLs is supported by adding an event handler to AssemblyLoadContext.Default.ResolvingUnmanagedDll
, which is triggered when resolving an unmanaged assembly in the context of a .NET assembly (e.g. a DllImport).
This handler consults current architecture and platform settings plus resolved package metadata and files across all dynamically referenced packages to look for a matching native DLL and then dynamically loads that DLL using an internal NativeAssemblyLoadContext that implements LoadNativeLibrary
via LoadUnmanagedDllFromPath
.
This process is not triggered for transitive native-to-native references, which are resolved with respect to the native DLL using standard rules of the operating system. Normally this means any transitive native dependencies must sit next to the native DLL at time of load.
That spec seems pretty clear to me and I'm pretty sure we shouldn't rush to use any native library loading functionality that .NET doesn't provide (even if that means we can't reasonable support transitive loading of native components with native-to-native references across multiple packages). It's just a total can of worms.
I'm just going to have to work out an approach that works for these horrific Torch native components. Most the complexity is to do with the vast size of the native components involved. So much AI.
@dsyme, I can add a switch that copies files to a single directory on resolution. Sort of like publish lite. That would take care of your issues. And mean you don't have to deal with the size issue. It wouldn't be the default it would be opt-in so normally scripts wouldn't have to deal with the issues.
It would also mean that we have an approach for transitive package dependencies, for when project build works and scripting fails. What do you think?
Kevin
@dsyme, I can add a switch that copies files to a single directory on resolution. Sort of like publish lite. That would take care of your issues. And mean you don't have to deal with the size issue. It wouldn't be the default it would be opt-in so normally scripts wouldn't have to deal with the issues.
For my use cases the problem is that this doesn't deal with size - we would end up consuming 1-2GB of copy and storage each .NET/F# interactive invocation (when running on the GPU - less for CPU Torch binaries), which is a significant pause time in itself. These native binaries are just vast (even if they don't all get paged in).
I think this is such a special case that we should just settle on where we are at the moment - with the spec above - and find some workaround for TorchSharp.
OK
So for now this is by design, and Don updated the rfc to not the transitive native dependency limitation.
One workaround I have used in the past is to add the native dlls to the system.environment "Path" variable, dynamically in the script.
Here is a snippet I have used to reference native dlls for ML.Net in the past:
open System
let path = Environment.GetEnvironmentVariable("path")
let path' =
path
+ ";" + "c:\users\admin\.nuget\packages\microsoft.ml\1.5.2\runtimes\win-x64\native\LdaNative.dll"
+ ";" + "c:\users\admin\.nuget\packages\microsoft.ml.cpumath\1.5.2\runtimes\win-x64\nativeassets\netstandard2.0\CpuMathNative.dll"
+ ";" + "c:\users\admin\.nuget\packages\microsoft.ml.fasttree\1.5.2\runtimes\win-x64\native\FastTreeNative.dll"
+ ";" + "c:\users\admin\.nuget\packages\microsoft.ml.mkl.components\1.5.2\runtimes\win-x64\native\SymSgdNative.dll"
+ ";" + "c:\users\admin\.nuget\packages\microsoft.ml.recommender\0.17.2\runtimes\win-x64\native\MatrixFactorizationNative.dll"
Environment.SetEnvironmentVariable("path",path')
Note that adding the full path to the dll works fine. Maybe just adding directories would also be ok but I have not tested.
Also now with "nuget: ..." style references I don't have to do the above except for one dll that is under the "nativeassets\netstandard2.0" directory.
@fwaris thanks, nativeassets was a new one on me. I will update probing to support it:
https://github.com/NuGet/Home/issues/2782 https://github.com/NuGet/Home/issues/3027#issuecomment-237645144 https://github.com/NuGet/NuGet.Client/commit/aed1d51b4c1190544d9f95bde48b089740309203
@KevinRansom FYI I just built a recommender model in ML.Net with FSI packaged with vs2019 preview. FSI could not find the MatrixFactorizationNative.dll. I had to add the directory to the "PATH" variable. Here is the script to load the packages and set the environment that worked for me:
#r "nuget: Microsoft.ML.AutoML, Version=0.17.2"
#r "nuget: Microsoft.ML.Recommender"
let userProfile = System.Environment.GetEnvironmentVariable("UserProfile")
let packageRoot = $@"{userProfile}\.nuget\packages"
let nativeLib = $@"{packageRoot}\microsoft.ml.cpumath\1.5.2\runtimes\win-x64\nativeassets\netstandard2.0"//CpuMathNative.dll"
let nativeLib2 = $@"{packageRoot}\microsoft.ml.recommender\0.17.2\runtimes\win-x64\native"//MatrixFactorizationNative.dll"
let path = System.Environment.GetEnvironmentVariable("path")
let path' = path + ";" + nativeLib + ";" + nativeLib2
System.Environment.SetEnvironmentVariable("path",path')
Thanks for the information.
Native DLLs are not being found in
libtorch-cpu
package which is referenced transitively from TorchSharp and DiffSharpAnalysis
Possible causes either
There is no managed DLL in the
libtorch-cpu
package (and hence the native resolution logic decides it doesn't need to probe around in that package)There is a problem with transitive native dependency - in this repro, there are two native DLLs ("libLibTorchSharp.so" from TorchSharp and
libtorch.so
fromlibtorch-cpu
)./ The first is being found but the load is failing due to the transitive reference on the second.The relevant resolved transitive package versions for the repro are:
Repro steps
A simpler repro might be this (though I'm not certain NativeLibrary.Load triggers resolution using the handlers)
Expected behavior
This works
Actual behavior
Known workarounds
This is a workaround to force the load of the native DLL that is not being found:
On Linux:
Put together these are:
Related information
It's possible there is something wrong with the packages but this works when referenced from a project.