dotnet / fsharp

The F# compiler, F# core library, F# language service, and F# tooling integration for Visual Studio
https://dotnet.microsoft.com/languages/fsharp
MIT License
3.88k stars 783 forks source link

Native dependency resolution problem for #r nuget #10136

Closed dsyme closed 4 years ago

dsyme commented 4 years ago

Native DLLs are not being found in libtorch-cpu package which is referenced transitively from TorchSharp and DiffSharp

Analysis

Possible causes either

  1. There is no managed DLL in the libtorch-cpu package (and hence the native resolution logic decides it doesn't need to probe around in that package)

  2. There is a problem with transitive native dependency - in this repro, there are two native DLLs ("libLibTorchSharp.so" from TorchSharp and libtorch.so from libtorch-cpu)./ The first is being found but the load is failing due to the transitive reference on the second.

The relevant resolved transitive package versions for the repro are:

DiffSharp-cpu,1.0.0-preview-258177528
TorchSharp,0.3.52276
libtorch-cpu,1.5.6

Repro steps

#r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;

open DiffSharp
dsharp.config(backend=Backend.Torch)
let t = dsharp.tensor [ 0 .. 10 ];;

A simpler repro might be this (though I'm not certain NativeLibrary.Load triggers resolution using the handlers)

#r "nuget: libtorch-cpu,1.5.6";;
System.Runtime.InteropServices.NativeLibrary.Load("torch_cpu")

Expected behavior

This works

Actual behavior

System.DllNotFoundException: Unable to load DLL 'C:/Users/Administrator/.nuget/packages/torchsharp/0.3.52276/runtimes\win-x64\native\LibTorchSharp.dll' or one of its dependencies: The specified module could not be found. (0x8007007E)
   at System.Runtime.Loader.AssemblyLoadContext.InternalLoadUnmanagedDllFromPath(String unmanagedDllPath)
   at System.Runtime.Loader.AssemblyLoadContext.LoadUnmanagedDllFromPath(String unmanagedDllPath)
   at Microsoft.DotNet.DependencyManager.NativeAssemblyLoadContext.LoadNativeLibrary(String path) in C:\GitHub\dsyme\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 46
   at Microsoft.DotNet.DependencyManager.NativeDllResolveHandlerCoreClr._resolveUnmanagedDll(Assembly _arg1, String name) in C:\GitHub\dsyme\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 114
   at <StartupCode$Microsoft-DotNet-DependencyManager>.$NativeDllResolveHandler.-ctor@120-2.Invoke(Assembly delegateArg0, String delegateArg1) in C:\GitHub\dsyme\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 120
   at System.Runtime.Loader.AssemblyLoadContext.GetResolvedUnmanagedDll(Assembly assembly, String unmanagedDllName)
   at System.Runtime.Loader.AssemblyLoadContext.ResolveUnmanagedDllUsingEvent(String unmanagedDllName, Assembly assembly, IntPtr gchManagedAssemblyLoadContext)
   at TorchSharp.Tensor.FloatTensor.THSTensor_newFloatScalar(Single scalar, Boolean requiresGrad)
   at TorchSharp.Tensor.FloatTensor.From(Single scalar, Boolean requiresGrad)
   at <StartupCode$DiffSharp-Backends-Torch>.$Torch.RawTensor.-ctor@900-1.Invoke(Single v)
   at DiffSharp.Backends.Torch.TorchStatics`2.CreateFromFlatArray(Array values, Int32[] shape, Device device)
   at DiffSharp.Tensor.create(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
   at DiffSharp.dsharp.config(FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
   at <StartupCode$FSI_0003>.$FSI_0003.main@()

Known workarounds

This is a workaround to force the load of the native DLL that is not being found:

System.Runtime.InteropServices.NativeLibrary.Load(@"C:\Users\Administrator\.nuget\packages\libtorch-cpu\1.5.6\runtimes\win-x64\native\torch_cpu.dll");;

#r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;

open DiffSharp
dsharp.config(backend=Backend.Torch)
let t = dsharp.tensor [ 0 .. 10 ];;

On Linux:

System.Runtime.InteropServices.NativeLibrary.Load(@"/home/jovyan/.nuget/packages/libtorch-cpu/1.5.6/runtimes/linux-x64/native/libtorch.so")

Put together these are:

let path1 = System.IO.Path.GetDirectoryName(typeof<DiffSharp.dsharp>.Assembly.Location)
let path2 =
    if System.Runtime.InteropServices.RuntimeInformation.IsOSPlatform(System.Runtime.InteropServices.OSPlatform.Linux) then
       path1 + "/../../../../libtorch-cpu/1.5.6/runtimes/linux-x64/native/libtorch.so"
    else
       path1 + "/../../../../libtorch-cpu/1.5.6/runtimes/win-x64/native/torch_cpu.dll"
System.Runtime.InteropServices.NativeLibrary.Load(path2)

Related information

It's possible there is something wrong with the packages but this works when referenced from a project.

cartermp commented 4 years ago

Repros on RC1 for me

KevinRansom commented 4 years ago

@dsyme, this is really cool. I will see what I can do.

KevinRansom commented 4 years ago

This repros quite nicely on Windows: I get ...

> #r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;
[Loading C:\Users\codec\AppData\Local\Temp\nuget\4456--e5f48bb6-ed4f-4c75-ba1c-dba1ef125698\Project.fsproj.fsx]
namespace FSI_0002.Project

>
- open DiffSharp
- dsharp.config(backend=Backend.Torch)
- let t = dsharp.tensor [ 0 .. 10 ];;
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.core\1.0.0-preview-258177528\lib\netstandard2.1\DiffSharp.Core.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.backends.torch\1.0.0-preview-258177528\lib\netcoreapp3.0\DiffSharp.Backends.Torch.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\torchsharp\0.3.52276\lib\netcoreapp3.0\TorchSharp.dll'...
System.DllNotFoundException: Unable to load DLL 'C:/Users/codec/.nuget/packages/torchsharp/0.3.52276/runtimes\win-x64\native\LibTorchSharp.dll' or one of its dependencies: The specified module could not be found. (0x8007007E)
   at System.Runtime.Loader.AssemblyLoadContext.InternalLoadUnmanagedDllFromPath(String unmanagedDllPath)
   at System.Runtime.Loader.AssemblyLoadContext.LoadUnmanagedDllFromPath(String unmanagedDllPath)
   at Microsoft.DotNet.DependencyManager.NativeAssemblyLoadContext.LoadNativeLibrary(String path) in C:\kevinransom\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 46
   at Microsoft.DotNet.DependencyManager.NativeDllResolveHandlerCoreClr._resolveUnmanagedDll(Assembly _arg1, String name) in C:\kevinransom\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 114
   at <StartupCode$Microsoft-DotNet-DependencyManager>.$NativeDllResolveHandler.-ctor@120-2.Invoke(Assembly delegateArg0, String delegateArg1) in C:\kevinransom\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 120
   at System.Runtime.Loader.AssemblyLoadContext.GetResolvedUnmanagedDll(Assembly assembly, String unmanagedDllName)
   at System.Runtime.Loader.AssemblyLoadContext.ResolveUnmanagedDllUsingEvent(String unmanagedDllName, Assembly assembly, IntPtr gchManagedAssemblyLoadContext)
   at TorchSharp.Tensor.FloatTensor.THSTensor_newFloatScalar(Single scalar, Boolean requiresGrad)
   at TorchSharp.Tensor.FloatTensor.From(Single scalar, Boolean requiresGrad)
   at <StartupCode$DiffSharp-Backends-Torch>.$Torch.RawTensor.-ctor@900-1.Invoke(Single v)
   at DiffSharp.Backends.Torch.TorchStatics`2.CreateFromFlatArray(Array values, Int32[] shape, Device device)
   at DiffSharp.Tensor.create(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
   at DiffSharp.dsharp.config(FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
   at <StartupCode$FSI_0003>.$FSI_0003.main@()
Stopped due to error
>
-
KevinRansom commented 4 years ago

So ...transitive native dependencies are not resolvable using the native resolution event handler. I am fairly confident that only works for managed code that has native dependencies. It looks to me like the design was primarily designed to make pinvoke work.

KevinRansom commented 4 years ago

@jkotas

With the AssemblyLoadContext.ResolvingUnmanagedDll event handler, we are not being notified for

native dll load attempts that are caused by a native dependency of a native library on either windows or linux. Is that "ByDesign" or is there a mechanism we can use that will allow us to detect attempts to transitively load native .dlls.

In our example above we have a managed library that loads a native library 'LibTorchSharp.dll' Which itself has a native dependency to a library: torch_cpu.dll

We get notified to locate the torchsharp dependency but not the torch_cpu.dll one.

Not that it will be much help but our handler is here: https://github.com/dotnet/fsharp/blob/main/src/fsharp/Microsoft.DotNet.DependencyManager/NativeDllResolveHandler.fs#L89

jkotas commented 4 years ago

It is by design.

The native library loader is part of OS. It does not expose events to resolve dependencies like this.

Different OSes provide assorted OS-specific mechanisms to help with this scenarios. For example, there is SetDllDirectoryW on Windows or RPATH on Unix.

KevinRansom commented 4 years ago

@jkotas , thanks mate, that was what I expected.

KevinRansom commented 4 years ago

@dsyme - if you are okay bundling the libtorch native libs with torchsharp then it will work fine:

It produced this.

c:\kevinransom\fsharp>dotnet artifacts\bin\fsi\Debug\netcoreapp3.1\fsi.exe --langversion:preview

Microsoft (R) F# Interactive version 11.0.0.0 for F# 5.0
Copyright (c) Microsoft Corporation. All Rights Reserved.

For help type #help;;

> #r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;
[Loading C:\Users\codec\AppData\Local\Temp\nuget\12316--989cf7ca-6ba9-4aab-a922-2bf875d5a299\Project.fsproj.fsx]
namespace FSI_0002.Project

>
- open DiffSharp
- dsharp.config(backend=Backend.Torch)
- let t = dsharp.tensor [ 0 .. 10 ];;
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.core\1.0.0-preview-258177528\lib\netstandard2.1\DiffSharp.Core.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.backends.torch\1.0.0-preview-258177528\lib\netcoreapp3.0\DiffSharp.Backends.Torch.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\torchsharp\0.3.52276\lib\netcoreapp3.0\TorchSharp.dll'...
val t : DiffSharp.Tensor =
  Tensor
    [0.000000, 1.000000, 2.000000, 3.000000, 4.000000, 5.000000, 6.000000, 7.000000, 8.000000, 9.000000, 10.000000]

>

I won't submit a pr, i'm not sure how your build works.

cartermp commented 4 years ago

Closing as external

dsyme commented 4 years ago

OK, thank you, I'll find some kind of resolution.

dsyme commented 4 years ago

@cartermp The packages work OK with applications. From what I see there is nothing wrong with the packages as such - the problem is with our dynamic loader, which doesn't handle transitive native references. (Applications handle this by copying all native DLLs to the one directory on build)

I'm not saying it's easy to fix, but it feels like the problem is with us, and could hit us with any packages that rely on transitive native dependencies, so I'll reopen the bug if that's ok.

That said I will try to find a workaround to arrange the TorchSharp native packages so they are non-transitive.

Different OSes provide assorted OS-specific mechanisms to help with this scenarios. For example, there is SetDllDirectoryW on Windows or RPATH on Unix.

@KevinRansom Given that in F#/.NET Interactive we are loading DLLs directly from the package directories, it does kind of feel like we should be using these mecahnisms to augment the native loader load paths. Hard to see any other systematic way to solve this

@jkotas Did you mean AddDllDirectory?

@dsyme - if you are okay bundling the libtorch native libs with torchsharp then it will work fine:

@KevinRansom Unfortunately this is not a practical solution.

  1. There are multiple different runtime native DLLs that work with the same managed DLL - basically CPU and GPU - the end application selects one

  2. The collected native DLLs are too large to fit in one nuget package - they are about 1.5GB for GPU for example. So they must be delivered in multiple packages, because in practice both nuget.org and Azure CI and other things place limits on nuget package size around 200MB.

Tricky problem

dsyme commented 4 years ago

I've documented this from the TorchSharp perspective here: https://github.com/xamarin/TorchSharp/issues/169

(I can see that we're not going to make this a high-priority thing for .NET Interactive and F# Interactive unless we hit other packages that have transitive native references.)

cartermp commented 4 years ago

Yes, this feels like a very niche thing that is low severity

jkotas commented 4 years ago

@jkotas Did you mean AddDllDirectory?

You are right. AddDllDirectory would be more appropriate for this.

KevinRansom commented 4 years ago

@dsyme, it shouldn't be too hard to make a change to also use this mechanism, l will put something together, hopefully over the weekend. You can let me know if it works for.

So ... my linux is not great, however, it seems that rpath is a string embedded into the library that has a dependency. This would require TorchSharp to embed this string for Linux, and to the best of my knowledge the Windows dll loader has no equivalent, so we would still need a windows solution.

The linux equivalent of AddDllDirectory is probably LD_LIBRARY_PATH. Which I can set after package resolution, but before dll load. Because it is an environment variable, if developers use fsi to spawn new processes they are also going to see this variable, which is somewhat dll hellish. Although I suppose I could swap it in before we do the load, and back out afterwards. Given that it is a dll load operation, that is bound to be vastly more expensive than swapping out an environment variable.

@jkotas , @dsyme could I ask you both to check my PR, if an when I implement it, and see if it is not too terrible. Thanks

Kevin

dsyme commented 4 years ago

@dsyme, it shouldn't be too hard to make a change to also use this mechanism, l will put something together, hopefully over the weekend. You can let me know if it works for.

It's ok, don't worry. I've come to the conclusion that all these native DLLs need to be in the same directory anyway. They register the "torch implementation directory" in a common registry in some way, and it looks like they all have to be in the same place otherwise we get whacky errors like "Key already registered with the same priority: GroupSpatialSoftmax"

I'll think about what to do. Awkward but hey

KevinRansom commented 4 years ago

Okay mate, may I close this issue?

jkotas commented 4 years ago

it is an environment variable, if developers use fsi to spawn new processes they are also going to see this variable, which is somewhat dll hellish.

Also, setting process environment variables is not thread safe on Unix that comes with its own set of problems...

baronfel commented 4 years ago

For really off-the-wall shenanigans, from what I'm reading you could use patchelf to change the rpath for a binary before loading it as well. You'd probably want to do some kind of shadow-copy of the binary so that you could munge it without clobbering, though.

KevinRansom commented 4 years ago

@baronfel, lol. That would be super cool but we would prefer not to copy files in #r, if we copied files, we would shove them in the same directory, and wouldn't have an issue. I am sort of thinking of adding an option that will do that for these really tricky scenarios. However, Don doesn't need it anymore so I'm not going to rush to do something, even real cool nerdy stuff :-)

dsyme commented 4 years ago

Okay mate, may I close this issue?

This is still, I think, in some sense a bug in the F# and .NET Interactive loading experience of packages. I think we can only close the issue if we document the limitation.

Are these docs up-to-date? They look a bit dated at first glance? https://github.com/fsharp/fslang-design/blob/master/tooling/FST-1027-fsi-references.md

KevinRansom commented 4 years ago

They don't discuss native dependency resolution at all. If we think it's a bug, I can prepare a fix, I would rather fix it now, than in two years time, when I've forgotten all of this stuff.

dsyme commented 4 years ago

Could you please update the specs to include information on #r for packages with native dependencies ? The specs should say what is meant to work and what isn't - right now it's a little hard to tell what the intended spec is.

Here's an approximate spec, maybe you can work from this?


Spec: Dynamic loading of packages containing native DLLs

Dynamic loading of packages containing native DLLs is supported by adding an event handler to AssemblyLoadContext.Default.ResolvingUnmanagedDll, which is triggered when resolving an unmanaged assembly in the context of a .NET assembly (e.g. a DllImport).

This handler consults current architecture and platform settings plus resolved package metadata and files across all dynamically referenced packages to look for a matching native DLL and then dynamically loads that DLL using an internal NativeAssemblyLoadContext that implements LoadNativeLibrary via LoadUnmanagedDllFromPath.

This process is not triggered for transitive native-to-native references, which are resolved with respect to the native DLL using standard rules of the operating system. Normally this means any transitive native dependencies must sit next to the native DLL at time of load.


That spec seems pretty clear to me and I'm pretty sure we shouldn't rush to use any native library loading functionality that .NET doesn't provide (even if that means we can't reasonable support transitive loading of native components with native-to-native references across multiple packages). It's just a total can of worms.

I'm just going to have to work out an approach that works for these horrific Torch native components. Most the complexity is to do with the vast size of the native components involved. So much AI.

KevinRansom commented 4 years ago

@dsyme, I can add a switch that copies files to a single directory on resolution. Sort of like publish lite. That would take care of your issues. And mean you don't have to deal with the size issue. It wouldn't be the default it would be opt-in so normally scripts wouldn't have to deal with the issues.

It would also mean that we have an approach for transitive package dependencies, for when project build works and scripting fails. What do you think?

Kevin

dsyme commented 4 years ago

@dsyme, I can add a switch that copies files to a single directory on resolution. Sort of like publish lite. That would take care of your issues. And mean you don't have to deal with the size issue. It wouldn't be the default it would be opt-in so normally scripts wouldn't have to deal with the issues.

For my use cases the problem is that this doesn't deal with size - we would end up consuming 1-2GB of copy and storage each .NET/F# interactive invocation (when running on the GPU - less for CPU Torch binaries), which is a significant pause time in itself. These native binaries are just vast (even if they don't all get paged in).

I think this is such a special case that we should just settle on where we are at the moment - with the spec above - and find some workaround for TorchSharp.

KevinRansom commented 4 years ago

OK

KevinRansom commented 4 years ago

So for now this is by design, and Don updated the rfc to not the transitive native dependency limitation.

fwaris commented 3 years ago

One workaround I have used in the past is to add the native dlls to the system.environment "Path" variable, dynamically in the script.

Here is a snippet I have used to reference native dlls for ML.Net in the past:

open System
let path = Environment.GetEnvironmentVariable("path")

let path' = 
    path 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml\1.5.2\runtimes\win-x64\native\LdaNative.dll" 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml.cpumath\1.5.2\runtimes\win-x64\nativeassets\netstandard2.0\CpuMathNative.dll" 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml.fasttree\1.5.2\runtimes\win-x64\native\FastTreeNative.dll" 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml.mkl.components\1.5.2\runtimes\win-x64\native\SymSgdNative.dll" 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml.recommender\0.17.2\runtimes\win-x64\native\MatrixFactorizationNative.dll" 

Environment.SetEnvironmentVariable("path",path')

Note that adding the full path to the dll works fine. Maybe just adding directories would also be ok but I have not tested.

Also now with "nuget: ..." style references I don't have to do the above except for one dll that is under the "nativeassets\netstandard2.0" directory.

KevinRansom commented 3 years ago

@fwaris thanks, nativeassets was a new one on me. I will update probing to support it:

https://github.com/NuGet/Home/issues/2782 https://github.com/NuGet/Home/issues/3027#issuecomment-237645144 https://github.com/NuGet/NuGet.Client/commit/aed1d51b4c1190544d9f95bde48b089740309203

fwaris commented 3 years ago

@KevinRansom FYI I just built a recommender model in ML.Net with FSI packaged with vs2019 preview. FSI could not find the MatrixFactorizationNative.dll. I had to add the directory to the "PATH" variable. Here is the script to load the packages and set the environment that worked for me:

#r "nuget: Microsoft.ML.AutoML, Version=0.17.2" 
#r "nuget: Microsoft.ML.Recommender"

let userProfile = System.Environment.GetEnvironmentVariable("UserProfile")
let packageRoot = $@"{userProfile}\.nuget\packages"
let nativeLib =  $@"{packageRoot}\microsoft.ml.cpumath\1.5.2\runtimes\win-x64\nativeassets\netstandard2.0"//CpuMathNative.dll"
let nativeLib2 = $@"{packageRoot}\microsoft.ml.recommender\0.17.2\runtimes\win-x64\native"//MatrixFactorizationNative.dll"
let path = System.Environment.GetEnvironmentVariable("path")
let path' =  path + ";" + nativeLib + ";" + nativeLib2
System.Environment.SetEnvironmentVariable("path",path')
KevinRansom commented 3 years ago

Thanks for the information.