dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.
MIT License
1.4k stars 182 forks source link

Trouble loading CUDA support under dotnet-interactive (C#) #1146

Open tombatron opened 1 year ago

tombatron commented 1 year ago

Hi there!

This may be related to #345, so please bear with me.

I'm trying to use TorchSharp with dotnet-interactive with Jupyter notebook and I'm encountering the following behavior:

image

Now, I am running my setup through Docker, so I wondered if perhaps I had an issue there, so I made a quick console application to test "connectivity" with my GPU.

image

I'm kind of struggling to get my arms around the issue, what are some next steps I could take?

Cheers!

NiklasGustafsson commented 1 year ago

I've tried to reproduce this problem with WSL, but I'm running into a very different problem, which doesn't even get as far as calling is_available()

NiklasGustafsson commented 1 year ago

It's worth trying -- and this is a total shot in the dark -- to delete everything *torch* under ~/.nuget/packages/ and then try again. I wonder if there's some sort of package confusion going on when running with .NET Interactive.

tombatron commented 1 year ago

Yeah that didn't seem to have any impact. :\

Here is a directory listing of my .nuget directory on the Jupyter server:

drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 google.protobuf
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 ilgpu
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part2-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part2-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-fragment2
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-fragment3
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part4-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part4-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part5-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part5-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part6
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part7
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 sharpziplib
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 skiasharp
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 skiasharp.nativeassets.macos
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 skiasharp.nativeassets.win32
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 system.memory
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 torchsharp
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 torchsharp-cuda-linux

Here is the error message:

System.TypeInitializationException: The type initializer for 'TorchSharp.torch' threw an exception.
 ---> System.NotSupportedException: The libtorch-cpu-linux-x64 package version 2.1.0.1 is not restored on this system. If using F# Interactive or .NET Interactive you may need to add a reference to this package, e.g. 
    #r "nuget: libtorch-cpu-linux-x64, 2.1.0.1". Trace from LoadNativeBackend:

TorchSharp: LoadNativeBackend: Initialising native backend, useCudaBackend = False

Step 1 - First try regular load of native libtorch binaries.

    Trying to load native component torch_cpu relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Failed to load native component torch_cpu relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Trying to load native component LibTorchSharp relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Failed to load native component LibTorchSharp relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Result from regular native load of LibTorchSharp is False

Step 3 - Alternative load from consolidated directory of native binaries from nuget packages

    torchsharpLoc = /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0
    packagesDir = /home/jovyan/.nuget/packages
    torchsharpHome = /home/jovyan/.nuget/packages/torchsharp/0.101.2
    Trying dynamic load for .NET/F# Interactive by consolidating native libtorch-cpu-linux-x64-* binaries to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/cpu...
    Consolidating native binaries, packagesDir=/home/jovyan/.nuget/packages, packagePattern=libtorch-cpu-linux-x64, packageVersion=2.1.0.1 to target=/home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/cpu...

   at TorchSharp.torch.LoadNativeBackend(Boolean useCudaBackend, StringBuilder& trace)
   at TorchSharp.torch.InitializeDeviceType(DeviceType deviceType)
   at TorchSharp.torch.InitializeDevice(Device device)
   at TorchSharp.torch..cctor()
   --- End of inner exception stack trace ---
   at TorchSharp.torch.TryInitializeDeviceType(DeviceType deviceType)
   at TorchSharp.torch.cuda.is_available()
   at Submission#5.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)
   at TorchSharp.torch.TryInitializeDeviceType(DeviceType deviceType)
   at TorchSharp.torch.cuda.is_available()
   at Submission#5.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)
tombatron commented 11 months ago

A co-worker (@wss-rbrennan) of mine my have shed some light on this issue:

"The problem has more to do with nuget itself. TorchSharp used a clever way of putting together the libtorch-cuda-12.1-linux-x64 package because nuget has a max package size of 250mb. The work around combines multiple packages at build time in a project, so your project works, but interactive doesn't build the same way, so the reference fails."

Not sure if this is a problem per se, or just something to account for when using TorchSharp from within interactive mode or whatever?

NiklasGustafsson commented 11 months ago

Thank you for the follow-up, and that's sort of what I was seeing, too. But... it used to work!

The stitching together only happens the first time, i.e. when a build finds that the stitched package is not available in the NuGet cache locally.

tombatron commented 11 months ago

You think there is some sort of snippet that could be run to ensure proper stitching?

On November 29, 2023, Ahmed Shirin @.***> wrote:

Thank you for the follow-up, and that's sort of what I was seeing, too. But... it used to work!

The stitching together only happens the first time, i.e. when a build finds that the stitched package is not available in the NuGet cache locally.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/TorchSharp/issues/1146#issuecomment-1832416714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5ESNE2ITABRV6RJBRQVLYG5YMVAVCNFSM6AAAAAA7LRQDI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZSGQYTMNZRGQ . You are receiving this because you authored the thread.Message ID: @.***>

NiklasGustafsson commented 11 months ago

And it works on Windows, which has the same package stitching problem.

NiklasGustafsson commented 11 months ago

You think there is some sort of snippet that could be run to ensure proper stitching?

All I can think of is a dotnet build, but I think you already did that and it worked, so the stitching should already have been done.

NiklasGustafsson commented 11 months ago

Or, maybe... clear the ~/.nuget/packages cache, as well as anything under ~/.packagemanagement/nuget. Then, build your console program again, then try the .ipynb file again. Another shot in the dark...

NiklasGustafsson commented 11 months ago

Okay, so after a bunch of finagling, I finally get to where you are -- no blow-up when loading the backend, but is_available() returns false. It works fine when I run one of the TorchExamples on CUDA, or on Windows interactively or console app.