Closed dsyme closed 3 years ago
Using this to debug a local build
#i @"nuget: E:\GitHub\dsyme\TorchSharp\bin/packages/Debug";;
#r @"nuget: TorchSharp,0.3.0-local-Debug-20200918";;
TorchSharp.Torch.IsCudaAvailable();;
dotnet artifacts\bin\fsi\Debug\netcoreapp3.1\fsi.exe /langversion:preview < a.fs
FYI: When using the 'lite version, the following seems to work for GPU.
#r "nuget: DiffSharp-lite,1.0.0-preview-485581354"
System.Runtime.InteropServices.NativeLibrary.Load(@"D:\libtorch\lib\torch_cuda.dll")
Here libtorch binaries where separately downloaded and installed from https://pytorch.org/
Using a colab GPU-enabled notebook to look into this
Install .NET SDK
!wget https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb && sudo dpkg -i packages-microsoft-prod.deb && sudo apt-get update && sudo apt-get install -y apt-transport-https && sudo apt-get update && sudo apt-get install -y dotnet-sdk-5.0
!dotnet --version
Get package
!echo "printfn \"phase0\"" > foo.fsx !echo "#r \"nuget: DiffSharp-cpu, 1.0.0-preview-681551353\";;" >> foo.fsx !cat foo.fsx !dotnet fsi foo.fsx
Investigate dependencies
!ls /root/.nuget/packages/libtorch-cpu/1.8.0.7/runtimes/linux-x64/native !ls /root/.nuget/packages/torchsharp/0.91.52458/runtimes/linux-x64/native/ !echo LD_LIBRARY_PATH=$LD_LIBRARY_PATH !ldd /root/.nuget/packages/torchsharp/0.91.52458/runtimes/linux-x64/native/libLibTorchSharp.so
Reveals /root/.nuget/packages/torchsharp/0.91.52458/runtimes/linux-x64/native/libLibTorchSharp.so: /lib/x86_64-linux-gnu/libpthread.so.0: version GLIBC_2.30 not found (required by /root/.nuget/packages/torchsharp/0.91.52458/runtimes/linux-x64/native/libLibTorchSharp.so)
This is an Ubuntu 18.04 problem - need to investigate where this dependency is coming from
Try explicit NativeLibrary.Load:
!echo "printfn \"phase1\"" > foo.fsx !echo "open System.Runtime.InteropServices" >> foo.fsx !echo "NativeLibrary.Load(\"/root/.nuget/packages/libtorch-cpu/1.8.0.7/runtimes/linux-x64/native/libtorch.so\") |> printfn \"%A\";;" >> foo.fsx !echo "NativeLibrary.Load(\"/root/.nuget/packages/torchsharp/0.91.52458/runtimes/linux-x64/native/libLibTorchSharp.so\") |> printfn \"%A\";;" >> foo.fsx !echo "printfn \"phase2\"" >> foo.fsx !echo "DiffSharp.dsharp.devices(backend=DiffSharp.Backend.Torch) |> printfn \"%A\"" >> foo.fsx !cat foo.fsx !dotnet fsi foo.fsx
Hack to update GLIBC_30 on Colab machine
!echo "deb http://ftp.us.debian.org/debian testing main contrib non-free" >> /etc/apt/sources.list !apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 04EE7237B7D453EC !apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138 !apt-get update !apt-get install build-essential -y
This problem is now fixed. I've left a notebook about the investigation in the DiffSharp repo: https://github.com/DiffSharp/DiffSharp/blob/dev/notebooks/debug/NativeCudaLoadLinux.ipynb
There are problems using LibTorch packages from .NET Interactive and F# Interactive on Linux because the native libraries are not unified into a single directory.
As a workaround you can avoid using the packages and load directly:
The problem has also been reported previously on Windows but is currently believed fixed. If not likewise use
Examples:
or
The problem is that .NET Interactive and F# Interactive load DLLs directly from package directories, instead of from a collected application directory. For managed DLLs this works OK, but native DLLs do not load transitive dependencies unless load paths are set up.
This is a general issue with the package load process used by .NET/F# Interactive, see https://github.com/dotnet/fsharp/issues/10136. We may be able to workaround the issue here, though it is challenging, for two reasons
There are multiple different runtime native DLLs that work with the same managed DLL - basically CPU and GPU - the end application selects one
The collected native DLLs are too large to fit in one nuget package - they are about 1.5GB for GPU for example. So they must be delivered in multiple packages, because in practice both nuget.org and Azure CI and other things place limits on nuget package size around 200MB.
Together these mean that the native DLLs end up scattered in diferent package directories.
A workaround used in DiffSharp for the CPU case is to force the load of
libtorch.so
(Linux) ortorch_cpu.dll
(Windows) before any other loads are requested.However this workaround doesn't work for the GPU case. Although awkward it is probably worth developing a similar workaround for the GPU case and adding both to the platform initialization logic of TorchSharp - in practice the specially huge nature of the corresponding native binaries makes this library a particular challenge.
There is also the general issue of long download times on first use, which is potentially very significant for notebook startup times on a container (though likely ok if running in a data centre).