Native runtime in `libtorch-cuda` nuget package does not yet work

dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.

MIT License

1.37k stars 177 forks source link

Native runtime in `libtorch-cuda` nuget package does not yet work #143

Closed gbaydin closed 4 years ago

gbaydin commented 4 years ago

Hi, I can successfully run TorchSharp 0.3.52216 with the libtorch-cpu 1.5.0 native runtime package. However, when I use libtorch-cuda-10.2-linux-x64 1.5.0, the code silently crashes when I attempt to run any TorchSharp method.

For example, in F#:

printfn "Start" 
printfn "%A" (Torch.IsCudaAvailable())
let t = FloatTensor.RandomN([|10L|], device="cpu")
printfn "%A" t

works and prints

Start
false
[10], device = cpu

when used with libtorch-cpu 1.5.0. But it silently fails after the first line when used with libtorch-cuda-10.2-linux-x64 1.5.0 with the following output:

Start

Initially, I was testing whether I can create CUDA tensors with TorchSharp, and I encountered this problem. Then I noticed even a call to Torch.IsCudaAvailable() or the creation of a CPU tensor also fail with this runtime. In normal usage we would expect to be able to create both CUDA and CPU tensors with libtorch-cuda-10.2-linux-x64 1.5.0, and the nuget package does seem to include libtorch_cpu.so.

Note that some months ago I could successfully create CUDA tensors with TorchSharp, relying on a manually installed libtorch in my system and setting LD_LIBRARY_PATH to point to the folder holding libtorch.so and the other libtorch library files. Note that this doesn't work now with the latest setup.

I'm running these on Ubuntu 20.04 with .net core sdk 3.1.201.

dsyme commented 4 years ago

Note that some months ago I could successfully create CUDA tensors with TorchSharp, relying on a manually installed libtorch in my system and setting LD_LIBRARY_PATH to point to the folder holding libtorch.so and the other libtorch library files. Note that this doesn't work now with the latest setup.

This approach should work if you simply don't reference libtorch-cuda-10.2-linux-x64 1.5.0 - just reference TorchSharp and get the native binaries via LD_LIBRARY_PATH. It would be great if we can confirm this

gbaydin commented 4 years ago

Ok, I've given it a second look and it does actually run as expected with an external installation of libtorch (without referencing the libtorch-cuda-10.2-linux-x64 nuget package). The problem was that my PyTorch installation was version 1.4.0. When I upgraded to 1.5.0, TorchSharp worked as expected.

The setup working for me is below. This is an F# console program referencing only TorchSharp package version 0.3.52216 and no other packages.

open TorchSharp.Tensor

[<EntryPoint>]
let main argv =
    let a = TorchSharp.Torch.IsCudaAvailable()
    printfn "%A" a
    let t = FloatTensor.RandomN([|10L|], device="cpu")
    printfn "%A" t
    let t2 = FloatTensor.RandomN([|10L|], device="cuda:0")
    printfn "%A" t2
    0 // return an integer exit code

Outputs the following

true
[10], device = cpu
[10], device = cuda

I have LD_LIBRARY_PATH set to include home/gunes/anaconda3/lib/python3.7/site-packages/torch/lib/ where I have the libtorch files that came with a standard PyTorch 1.5.0 installation through the normal pip install process.

-rwxrwxr-x  1 gunes gunes     225008 Jun  1 21:19 libc10_cuda.so*
-rwxrwxr-x  1 gunes gunes     472728 Jun  1 21:19 libc10.so*
-rwxrwxr-x  1 gunes gunes    1884384 Jun  1 21:19 libcaffe2_detectron_ops_gpu.so*
-rwxrwxr-x  1 gunes gunes      75768 Jun  1 21:19 libcaffe2_module_test_dynamic.so*
-rwxrwxr-x  1 gunes gunes      22016 Jun  1 21:19 libcaffe2_nvrtc.so*
-rwxrwxr-x  1 gunes gunes     118640 Jun  1 21:19 libcaffe2_observers.so*
-rwxrwxr-x  1 gunes gunes     523816 Jun  1 21:19 libcudart-80664282.so.10.2*
-rwxrwxr-x  1 gunes gunes     168720 Jun  1 21:19 libgomp-7c85b1e2.so.1*
-rwxrwxr-x  1 gunes gunes   22045456 Jun  1 21:19 libnvrtc-08c4863f.so.10.2*
-rwxrwxr-x  1 gunes gunes    4862944 Jun  1 21:19 libnvrtc-builtins.so*
-rwxrwxr-x  1 gunes gunes      43520 Jun  1 21:19 libnvToolsExt-3965bdd0.so.1*
-rwxrwxr-x  1 gunes gunes      41592 Jun  1 21:19 libshm.so*
-rwxrwxr-x  1 gunes gunes  267175432 Jun  1 21:19 libtorch_cpu.so*
-rwxrwxr-x  1 gunes gunes 1056836368 Jun  1 21:19 libtorch_cuda.so*
-rwxrwxr-x  1 gunes gunes      16760 Jun  1 21:19 libtorch_global_deps.so*
-rwxrwxr-x  1 gunes gunes   16535688 Jun  1 21:19 libtorch_python.so*
-rwxrwxr-x  1 gunes gunes     116240 Jun  1 21:19 libtorch.so*

I'm sharing the full file list thinking it might help with debugging the problem. In this Ubuntu 20.04 system I have CUDA version 10.2.

dsyme commented 4 years ago

That's great I'm checking with #144 that the CUDA binaries we download pass the TorchSharp tests, I think they will. That will mean the problem is somewhere in the packaging or how the binaries are placed in the application.

If possible could you try making an application again that references TorchSharp 0.3.52216 and libtorch-cuda-10.2-linux-x64 1.5.0, then clean and build, then

list the contents of the application native libraries after building e.g. ConsoleApp7\ConsoleApp7\bin\Debug\netcoreapp3.1\runtimes\linux-x64\native, it should look similar to above
do a file comparison between the files in that directory and the files you've got above (they should be identical with libLibTorchSharp.so added)
the executable bit may not be set but I understand that doesn't matter. Try setting it with chmod+x .... then re-running tests
perhaps try moving the *.so to the root ConsoleApp7\ConsoleApp7\bin\Debug\netcoreapp3.1

If you like I could log on to your machine for a while and poke around.

gbaydin commented 4 years ago

Hi, file list and size comparison below. On the left side are the libtorch files that worked before (the ones listed in my previous message). On the right side are the files under bin/Debug/netcoreapp3.1/runtimes/linux-x64/native resulting from referencing TorchSharp 0.3.52216 and libtorch-cuda-10.2-linux-x64 1.5.0

Screenshot_2020-06-02_01-33-07

This is the list for bin/Debug/netcoreapp3.1/runtimes/linux-x64/native

-rwxrw-r-- 1 gunes gunes    225008 May 24 16:18 libc10_cuda.so*
-rwxrw-r-- 1 gunes gunes     35088 May 24 16:18 libc10d_cuda_test.so*
-rwxrw-r-- 1 gunes gunes    472728 May 24 16:18 libc10.so*
-rwxrw-r-- 1 gunes gunes   1884384 May 24 16:18 libcaffe2_detectron_ops_gpu.so*
-rwxrw-r-- 1 gunes gunes     75768 May 24 16:18 libcaffe2_module_test_dynamic.so*
-rwxrw-r-- 1 gunes gunes     22016 May 24 16:18 libcaffe2_nvrtc.so*
-rwxrw-r-- 1 gunes gunes    118640 May 24 16:18 libcaffe2_observers.so*
-rwxrw-r-- 1 gunes gunes    523816 May 24 16:18 libcudart-80664282.so.10.2*
-rwxrw-r-- 1 gunes gunes    346296 May 24 16:18 libfbjni.so*
-rwxrw-r-- 1 gunes gunes    168720 May 24 16:18 libgomp-7c85b1e2.so.1*
-rwxrw-r-- 1 gunes gunes   1416368 May 24 16:19 libLibTorchSharp.so*
-rwxrw-r-- 1 gunes gunes  22045456 May 24 16:18 libnvrtc-08c4863f.so.10.2*
-rwxrw-r-- 1 gunes gunes   4862944 May 24 16:18 libnvrtc-builtins.so*
-rwxrw-r-- 1 gunes gunes     43520 May 24 16:18 libnvToolsExt-3965bdd0.so.1*
-rwxrw-r-- 1 gunes gunes    312352 May 24 16:18 libpytorch_jni.so*
-rwxrw-r-- 1 gunes gunes     41592 May 24 16:18 libshm.so*
-rwxrw-r-- 1 gunes gunes 267175432 May 24 16:18 libtorch_cpu.so*
-rw------- 1 gunes gunes 900000000 May 28 02:00 libtorch_cuda.so
-rwxrw-r-- 1 gunes gunes     16760 May 24 16:18 libtorch_global_deps.so*
-rwxrw-r-- 1 gunes gunes  16535688 May 24 16:18 libtorch_python.so*
-rwxrw-r-- 1 gunes gunes    116240 May 24 16:18 libtorch.so*

In the nuget version there are some extra files libc10d_cuda_test.so, libfbjni.so, libpytorch_jni.so and I think the important-looking giant file libtorch_cuda.so seems somehow "truncated". Perhaps this has something to do with the package parts and fragments I see when I click on "dependencies" here: https://www.nuget.org/packages/libtorch-cuda-10.2-linux-x64/

libtorch-cuda-10.2-linux-x64-part1 (>= 1.5.0)
libtorch-cuda-10.2-linux-x64-part2-fragment1 (>= 1.5.0)
libtorch-cuda-10.2-linux-x64-part2-fragment2 (>= 1.5.0)
libtorch-cuda-10.2-linux-x64-part2-fragment3 (>= 1.5.0)
libtorch-cuda-10.2-linux-x64-part2-primary (>= 1.5.0)

gbaydin commented 4 years ago

The following are probably not very likely to work before fixing the truncated libtorch_cuda.so file, but I still tried:

Setting the file permissions to be the same (-rwxrwxr-x) with the files that worked before didn't work. Copying all files under bin/Debug/netcoreapp3.1/runtimes/linux-x64/native to bin/Debug/netcoreapp3.1 didn't work. "Didn't work" means the program silently fails in the way described in the first message in this issue.

gbaydin commented 4 years ago

One last thing I tried was to copy the "good" (working) libtorch_cuda.so file (with size 1056836368) to bin/Debug/netcoreapp3.1/runtimes/linux-x64/native and replace the broken libtorch_cuda.so file (with size 900000000).

This does not work if I run the console app with dotnet run (which before execution replaces the libtorch_cuda.so file again with the broken version from the nuget package).

But it does run successfully if I just run the console app executable in the folder bin/Debug/netcoreapp3.1 that was built previously.

dsyme commented 4 years ago

Thanks, yes, this has isolated the problem, I can see what the fix is.

dsyme commented 4 years ago

I can't yet see what went wrong here, though the final size of this binary is definitely wrong, and indicates that one of the packages was missing:

-rw------- 1 gunes gunes 900000000 May 28 02:00 libtorch_cuda.so

It seems the problem must have been in the delivery of packages - perhaps one failed to download but the build continued.

I'll add some checking for hash sum etc.

When they are delivered to my machine they result in a binary of the correct size (it's not identical to

The reconsituted file on my machine:

$ ls -Flas /c/Users/dsyme/source/repos/ConsoleApp7/ConsoleApp7/bin/Debug/netcoreapp3.1/runtimes/linux-x64/native

1032064 -rw-r--r-- 1 dsyme 1049089 1056832272 Jun  1 19:44 libtorch_cuda.so

The original file I downloaded:

$ ls -Flas /c/GitHub/dsyme/libtorch-cuda-10.2/libtorch-shared-with-deps-1.5.0/libtorch/lib/libtorch_cuda.so

1032064 -rw-r--r-- 1 dsyme 1049089 1056832272 Apr 21 01:32 /c/GitHub/dsyme/libtorch-cuda-10.2/libtorch-shared-with-deps-1.5.0/libtorch/lib/libtorch_cuda.so

pkese commented 4 years ago

I did a git clean -fdx and dotnet build on DiffSharp repo and I'm getting correct size libtorch_cuda.so: 1056832272 (on Linux)

I don't quite know how to test it though. Running tests/Test does not use any GPU and besides causes an out-of-memory after a few batches.

When I force dsharp.config(backend=Backend.Torch, device=Device.GPU) I get an exception saying CUDA non available in the current machine.

dsyme commented 4 years ago

I did a git clean -fdx and dotnet build on DiffSharp repo and I'm getting correct size libtorch_cuda.so: 1056832272 (on Linux)

Thanks for trying!

The Test.fsproj in dev is not quite right, it currently has this:

<PackageReference Include="libtorch-cpu" Version="$(LibTorchVersion)" />
<PackageReference Include="libtorch-cuda-10.2-linux-x64" Version="$(LibTorchVersion)" />

However only one of these two should be used. We will have to add some kind of protection against referencing both.

CUDA non available in the current machine

I'm presuming this is because libtorch-cpu took precedence. If you have a moment to try removing that then checking what happens taht would be great.

pkese commented 4 years ago

So I've commented out the 'libtorch-cpu' and it gets a little bit further:

Downloading "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz" to "/home/peterk/work/tmp/DiffSharp/tests/Test/data/mnist/train-images-idx3-ubyte.gz"
Downloading "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz" to "/home/peterk/work/tmp/DiffSharp/tests/Test/data/mnist/train-labels-idx1-ubyte.gz"
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at DiffSharp.Backends.Torch.TorchRawTensor.ToRawData[[System.Single, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]]()
   at DiffSharp.Backends.Torch.TorchRawTensor.ToRawData()
   at DiffSharp.Backends.Torch.TorchRawTensor.System-Runtime-Serialization-ISerializable-GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
   at System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.InitSerialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter, System.Runtime.Serialization.SerializationBinder)
   at System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Write(System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo, System.Runtime.Serialization.Formatters.Binary.NameInfo, System.Runtime.Serialization.Formatters.Binary.NameInfo)
   at System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Serialize(System.Object, System.Runtime.Serialization.Formatters.Binary.BinaryFormatterWriter, Boolean)
   at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(System.IO.Stream, System.Object, Boolean)
   at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(System.IO.Stream, System.Object)
   at DiffSharp.Util.saveBinary[[System.__Canon, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](System.__Canon, System.String)
   at DiffSharp.Tensor.save(System.String)
   at DiffSharp.DiffSharp.save(DiffSharp.Tensor, System.String)
   at DiffSharp.Data.MNIST..ctor(System.String, Microsoft.FSharp.Core.FSharpOption`1<System.Collections.Generic.IEnumerable`1<System.String>>, Microsoft.FSharp.Core.FSharpOption`1<Boolean>, Microsoft.FSharp.Core.FSharpOption`1<Microsoft.FSharp.Core.FSharpFunc`2<DiffSharp.Tensor,DiffSharp.Tensor>>, Microsoft.FSharp.Core.FSharpOption`1<Microsoft.FSharp.Core.FSharpFunc`2<DiffSharp.Tensor,DiffSharp.Tensor>>)
   at Program.main(System.String[])

Looks like a MNIST loader issue with Torch.

dsyme commented 4 years ago

OK thanks, yes that's getting further. Saving GPU tensors is evidently busted (or perhaps that's by design and it's just not giving a good error message).

Could you send a PR to do the follow?

Adjust TorchRawTensor ToRawData to give a good error message for GPU tensors
Adjust tensor.save to always move the tensor to CPU first (double check that's what PyTorch does)

If that doesn't unblock then try removing the dsharp.save calls in Data.fs (I think they're just there to reduce cost with making/reloading data)

dsyme commented 4 years ago

I'm adjusting the TorchSharp packages so that

SHA get checked after package download
Referencing both libtorch-cpu and libtorch-cuda-* gives this:

Error       Two TorchSharp runtime packages have been referenced (both libtorch-cpu and libtorch-cuda)  ConsoleApp6 C:\Users\dsyme\.nuget\packages\libtorch-cpu\1.5.3\buildTransitive\netstandard2.0\libtorch-cpu.targets   6

pkese commented 4 years ago

I'll look into the ToRawData thing.

In the meanwhile I've tried to replace libtorch-cpu with libtorch-cuda-10.2-linux-x64 on the normal DiffSharp.Tests project and added dsharp.config(backend=Backend.Torch, device=Device.GPU) into the test fixture and I'm getting:

The active test run was aborted. Reason: Test host process crashed : terminate called after throwing an instance of 'c10::Error'
  what():  Expected one of cpu, cuda, mkldnn, opengl, opencl, ideep, hip, msnpu device type at start of device string: gpu (parse_type at /pytorch/c10/core/Device.cpp:37)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fe315030536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1a060 (0x7fe31501d060 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #2: c10::Device::Device(std::string const&) + 0x1e4 (0x7fe31501d4c4 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #3: <unknown function> + 0xd00f3 (0x7fe3155410f3 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #4: THSTensor_ones + 0x91 (0x7fe315502e41 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #5: [0x7fe319bf120c]

Could it be that 'cuda' string was expected rather than 'gpu'.

BTW, I'm rather new to both DiffSharp as well as Torch (this is my first test) so take my reports with a grain of salt. I'm normally using Tensorflow for machine learning.

dsyme commented 4 years ago

Yup well, you're coming in to quite a raw branch :-) We're in the middle of getting this to boot up :)

Change this:

    | Device.GPU -> "gpu"

    | Device.GPU -> "cuda"

thanks

pkese commented 4 years ago

So I did that "gpu" -> "cuda" change and the error is indeed different:

  X TestCurl [17ms]
  Error Message:
   System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fb9548fc536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7fb8f1deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x10493d7 (0x7fb8f1e243d7 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf96a7b (0x7fb8f1d71a7b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x10c5c23 (0x7fb92e329c23 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0x2b4b952 (0x7fb92fdaf952 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #6: <unknown function> + 0x10c5c23 (0x7fb92e329c23 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: at::Tensor c10::KernelFunction::callUnboxed<at::Tensor, at::Tensor const&, long, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, long, at::Tensor const&) const + 0x14d (0x7fb954dfe2ad in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #8: at::Tensor c10::Dispatcher::callUnboxed<at::Tensor, at::Tensor const&, long, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, long, at::Tensor const&) const + 0xf6 (0x7fb954dfe136 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #9: THSTensor_index_select + 0x87 (0x7fb954dd7a67 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: [0x7fb9a152a509]

  Stack Trace:
     at TorchSharp.Torch.CheckForErrors()
   at TorchSharp.Tensor.TorchTensor.IndexSelect(Int64 dimension, TorchTensor index)
   at DiffSharp.Backends.Torch.TorchRawTensor.GetSlice(Int32[,] fullBounds) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 71
   at DiffSharp.Tensor.GetSlice(Int32[,] bounds) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 375
   at DiffSharp.Tensor.GetSlice(Int32[,] bounds) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 377
   at DiffSharp.Tensor.get_Item(Int32[] index) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 384
   at Tests.TestDiffSharp.fvect3vect3(Tensor x) in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 44
   at <StartupCode$DiffSharp-Tests>.$TestDiffSharp.TestCurl@519.Invoke(Tensor x) in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 519
   at DiffSharp.DiffSharp.evalReverseDiff(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 250
   at DiffSharp.DiffSharp.fjacobian(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 298
   at DiffSharp.DiffSharp.fcurl(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 341
   at Tests.TestDiffSharp.TestCurl() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 519

dsyme commented 4 years ago

Cool that's getting further. How many tests passed? (if any)

(nb. without IndexSelect working on GPU tensors I wouldn't expect many)

BTW good to see the reasonably useful error stacks coming out, that's also encouraging

pkese commented 4 years ago

Well, I only configured the GPU in TestDiffSharp.fs file which contains 26 tests.

What I'm getting is 12 failed and 83 passing tests.

I'll try other test files as well.

dsyme commented 4 years ago

Cool thanks. 83 is pretty good for a first run. TestTensor.fs will likely contain some failures.

dsyme commented 4 years ago

Feel free to send the full lists of passing/failing tests, thanks

dsyme commented 4 years ago

Ah the problem is here in IndexSelect in the Torch backend:

            let idxs = LongTensor.Arange(int64 start, int64 stop, 1L)

This is creating a CPU tensor then it should be creating one with the same characteristics as the input.

I'll prep a fix, paste it here and start a PR

pkese commented 4 years ago

I've added [<SetUp>] to all tests to configure GPU and the failing tests are:

TestDerivativeGather  
TestCurl  
TestCurlDivergence  
TestDivergence  
TestGrad  
TestGradhessian  
TestGradhessianv  
TestGradv  
TestHessian  
TestHessianv  
TestJacobian

They all report the same exception (quoted above).

dsyme commented 4 years ago

Cool the fix should just be this:

            let idxs = LongTensor.Arange(int64 start, int64 stop, 1L, device=toTorchDevice t.Device)

pkese commented 4 years ago

Applied the change and now I'm getting

The active test run was aborted. Reason: Test host process crashed : Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at TorchSharp.Tensor.TorchTensor.DataItem[[System.Int32, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]]()
   at DiffSharp.Backends.Torch.TorchRawTensor.GetItem(Int32[])
   at DiffSharp.Backends.Torch.TorchRawTensor.ToValuesTyped[[System.Int32, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.Int32, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](Microsoft.FSharp.Core.FSharpFunc`2<Int32,Int32>)
   at DiffSharp.Backends.Torch.TorchRawTensor.ToValues()
   at DiffSharp.Backends.RawTensor.ToScalar()
   at DiffSharp.Tensor.toScalar()
   at <StartupCode$DiffSharp-Core>.$Tensor+push@1760.Invoke(Microsoft.FSharp.Collections.FSharpList`1<System.Tuple`2<DiffSharp.Tensor,DiffSharp.Tensor>>)
   at DiffSharp.Tensor.reversePush(DiffSharp.Tensor)
   at DiffSharp.Tensor.reverse(Microsoft.FSharp.Core.FSharpOption`1<DiffSharp.Tensor>, Microsoft.FSharp.Core.FSharpOption`1<Boolean>)
   at Tests.TestDerivatives.TestDerivativeGather()
   at System.RuntimeMethodHandle.InvokeMethod(System.Object, System.Object[], System.Signature, Boolean, Boolean)
   at System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)

dsyme commented 4 years ago

OK, yes, this is DataItem on a GPU tensor again.

Collected fixes are here, thanks https://github.com/DiffSharp/DiffSharp/pull/119, I think it should include fixes for all of the above.

dsyme commented 4 years ago

@pkese I have merged those fixes to dev if you want to pull and give it another crack.

I'll also add a bug to TorchSharp about TorchSharp.Tensor.TorchTensor.DataItem giving a hard crash when ued on a GPU tensor - it should at least give a decent exception.

pkese commented 4 years ago

After applying the DiffSharp#119 there are many more tests passing:

Total tests: 266
     Passed: 241
     Failed: 25

There are two common error types:

  X TestGrad [102ms]
  Error Message:
     Expected: <Tensor [-149.000000, 50.000000]>
  But was:  <Tensor [0.000000, 0.000000]>

  Stack Trace:
     at Tests.TestDiffSharp.TestGrad() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 192

  X TestGradhessian [104ms]
  Error Message:
     Expected: <Tensor [[1702.000000, -600.000000],
 [-600.000000, 200.000000]]>
  But was:  <Tensor [[0.000000, 0.000000],
 [0.000000, 0.000000]]>

  Stack Trace:
     at Tests.TestDiffSharp.TestGradhessian() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 435

  X TestGradhessianv [34ms]
  Error Message:
     Expected: <Tensor [2051.000000, -700.000000]>
  But was:  <Tensor [0.000000, 0.000000]>

  Stack Trace:
     at Tests.TestDiffSharp.TestGradhessianv() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 408

  X TestHessian [97ms]
  Error Message:
     Expected: <Tensor [[1702.000000, -600.000000],
 [-600.000000, 200.000000]]>
  But was:  <Tensor [[0.000000, 0.000000],
 [0.000000, 0.000000]]>

  Stack Trace:
     at Tests.TestDiffSharp.TestHessian() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 492

  X TestHessianv [56ms]
  Error Message:
     Expected: <Tensor [2051.000000, -700.000000]>
  But was:  <Tensor [0.000000, 0.000000]>

  Stack Trace:
     at Tests.TestDiffSharp.TestHessianv() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 466

  X TestJacobian [100ms]
  Error Message:
     Expected: <Tensor [[1.000000, 4.000000, 2.000000],
 [4.000000, 0.000000, 0.000000]]>
  But was:  <Tensor [[0.000000, 0.000000, 0.000000],
 [0.000000, 0.000000, 0.000000]]>

  Stack Trace:
     at Tests.TestDiffSharp.TestJacobian() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 334

  X TestJacobianTv [22ms]
  Error Message:
     Expected: <Tensor [-124.375000, -136.875000, -51.875000]>
  But was:  <Tensor [0.000000, 0.000000, 0.000000]>

  Stack Trace:
     at Tests.TestDiffSharp.TestJacobianTv() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 290

  X TestLaplacian [54ms]
  Error Message:
     Expected: <Tensor 1902.000000>
  But was:  <Tensor 0.000000>

  Stack Trace:
     at Tests.TestDiffSharp.TestLaplacian() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 512

and

  X TestOne [7ms]
  Error Message:
   System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #2 'other' in call to _th_equal (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fbb55af5536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7fbaa5deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x1044cd2 (0x7fbaa5e1fcd2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf79e80 (0x7fbaa5d54e80 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x2b2b34c (0x7fbae3d8f34c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: bool c10::Dispatcher::callUnboxedWithDispatchKey<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x181 (0x7fbb55ffd291 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #6: THSTensor_equal + 0x4c (0x7fbb55fdc87c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #7: [0x7fbb5b08e843]

  Stack Trace:
     at TorchSharp.Torch.CheckForErrors()
   at TorchSharp.Tensor.TorchTensor.Equal(TorchTensor target)
   at DiffSharp.Backends.Torch.TorchRawTensor.Equals(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 311
   at DiffSharp.Tensor.Equals(Object other) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 220
   at NUnit.Framework.Constraints.NUnitEqualityComparer.AreEqual(Object x, Object y, Tolerance& tolerance, Boolean topLevelComparison)
   at NUnit.Framework.Constraints.EqualConstraint.ApplyTo[TActual](TActual actual)
   at NUnit.Framework.Assert.That[TActual](TActual actual, IResolveConstraint expression, String message, Object[] args)
   at Tests.TestDiffSharp.TestOne() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 77

the last one is present at TestOne, TestZero, TestModelClone, TestModelLinear, TestModelParametersDiff, TestModelSaveLoad, TestModelSaveLoadParameters and some Optimizer tests.

pkese commented 4 years ago

...I'm not sure if some of these errors are appearing because I forced
dsharp.config(backend=Backend.Torch, device=Device.GPU)
in all test Setups.

If I remove that then all tests pass, but apparently GPU is not being used.

pkese commented 4 years ago

Even the tests/Test now starts.

After a while it reports OOM:

net params: 1199882
Torch
Duration   |Iters| Ep|  Minib| Loss
0.00:00:03 |   1 | 1 | 1/937 | 2.316844e+000 🡾 New min
Unhandled exception. System.Runtime.InteropServices.ExternalException (0x80004005): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 5.93 GiB total capacity; 4.86 GiB already allocated; 192.00 KiB free; 5.04 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)

but at least it is consistent with the CPU version which also runs out of memory.

pkese commented 4 years ago

It appears that there are several cases that miss proper conversions. Above is _th_equal, but there's also

_th_mm:

  X TestModelParametersDiff [7ms]
  Error Message:
   System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_mm (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fbb55af5536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7fbaa5deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x10539b9 (0x7fbaa5e2e9b9 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf76dc8 (0x7fbaa5d51dc8 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x10c3ec0 (0x7fbae2327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0x2c9b6fe (0x7fbae3eff6fe in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #6: <unknown function> + 0x10c3ec0 (0x7fbae2327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: at::Tensor c10::Dispatcher::callUnboxedWithDispatchKey<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x17c (0x7fbb55f74a6c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #8: at::Tensor::mm(at::Tensor const&) const + 0xa2 (0x7fbb55fefed2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #9: THSTensor_mm + 0x5d (0x7fbb55fe0b7d in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: [0x7fbb5b813261]

  Stack Trace:
     at TorchSharp.Torch.CheckForErrors()
   at TorchSharp.Tensor.TorchTensor.Mm(TorchTensor target)
   at DiffSharp.Backends.Torch.TorchRawTensor.MatMulT2T2(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 483
   at <StartupCode$DiffSharp-Core>.$Tensor.fRaw@757-17.Invoke(Tuple`2 tupledArg) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 757
   at DiffSharp.Tensor.matmul(Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 765
   at <StartupCode$DiffSharp-Core>.$Tensor.push@1760.Invoke(FSharpList`1 ts) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1811
   at DiffSharp.Tensor.reversePush(Tensor value) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1918
   at DiffSharp.Tensor.reverse(FSharpOption`1 value, FSharpOption`1 zeroDerivatives) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1639
   at DiffSharp.DiffSharp.reverse(Tensor value, Tensor tensor) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 246
   at <StartupCode$DiffSharp-Core>.$DiffSharp.r@251.Invoke(Tensor v) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 251
   at DiffSharp.DiffSharp.fgrad(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 311
   at DiffSharp.DiffSharp.grad(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 312
   at Tests.TestModel.TestModelParametersDiff() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 148

and multiple cases where TorchSharp.Tensor.TorchTensor.Mul fails:

  X TestModelLinear [6ms]
  Error Message:
   System.Runtime.InteropServices.ExternalException : expected device cpu but got device cuda:0 (compute_types at /pytorch/aten/src/ATen/native/TensorIterator.cpp:246)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fbb55af5536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: at::TensorIterator::compute_types() + 0x17d4 (0x7fbae209fc74 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #2: at::TensorIterator::build() + 0x44 (0x7fbae20a1b64 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #3: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x146 (0x7fbae20a2216 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #4: at::native::mul(at::Tensor const&, at::Tensor const&) + 0x3a (0x7fbae1dc1eba in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0xf76ef8 (0x7fbaa5d51ef8 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #6: <unknown function> + 0x10c3ec0 (0x7fbae2327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: <unknown function> + 0x2d2e779 (0x7fbae3f92779 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #8: <unknown function> + 0x10c3ec0 (0x7fbae2327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #9: at::Tensor c10::Dispatcher::callUnboxedWithDispatchKey<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x17c (0x7fbb55f74a6c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: at::Tensor::mul(at::Tensor const&) const + 0xa2 (0x7fbb55fefff2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #11: THSTensor_mul + 0x5d (0x7fbb55fe0dbd in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #12: [0x7fbb5b7cc071]

  Stack Trace:
     at TorchSharp.Torch.CheckForErrors()
   at TorchSharp.Tensor.TorchTensor.Mul(TorchTensor target)
   at DiffSharp.Backends.Torch.TorchRawTensor.MulTT(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 425
   at <StartupCode$DiffSharp-Core>.$Tensor.fRaw@615-8.Invoke(Tuple`2 tupledArg) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 615
   at DiffSharp.Tensor.op_Multiply(Tensor a, Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 623
   at <StartupCode$DiffSharp-Core>.$Tensor.push@1760.Invoke(FSharpList`1 ts) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1788
   at DiffSharp.Tensor.reversePush(Tensor value) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1918
   at DiffSharp.Tensor.reverse(FSharpOption`1 value, FSharpOption`1 zeroDerivatives) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1639
   at DiffSharp.DiffSharp.reverse(Tensor value, Tensor tensor) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 246
   at <StartupCode$DiffSharp-Core>.$DiffSharp.r@251.Invoke(Tensor v) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 251
   at DiffSharp.DiffSharp.fgrad(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 311
   at DiffSharp.DiffSharp.grad(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 312
   at Tests.TestModel.TestModelLinear() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 258

dsyme commented 4 years ago

Cool thank you!

dsyme commented 4 years ago

I think https://github.com/DiffSharp/DiffSharp/pull/120 should address half of those. Possibly more.

It's hard to tell why the grad/Jacobian tests etc. are failing by just giving zeros - they exercise quite a lot of functionality - it could plausibly be the same root cause though I expect one or two more glitches.

...I'm not sure if some of these errors are appearing because I forced dsharp.config(backend=Backend.Torch, device=Device.GPU) in all test Setups.

It is quite a stress test! I was planning on gingerly turning things on test by test but this is much more effective at flushing out bugs :-)

dsyme commented 4 years ago

I merged https://github.com/DiffSharp/DiffSharp/pull/120 if you want to try it again, thanks

pkese commented 4 years ago

Yay, you're now at

Total tests: 266
     Passed: 263
     Failed: 3

All that remained are 3 identical cases of

  X TestModelClone [32ms]
  Error Message:
   System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #2 'other' in call to _th_equal (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f4ec46c5536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7f4e15deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x1044cd2 (0x7f4e15e1fcd2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf79e80 (0x7f4e15d54e80 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x2b2b34c (0x7f4e53d8f34c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: bool c10::Dispatcher::callUnboxedWithDispatchKey<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x181 (0x7f4ec4bcd291 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #6: THSTensor_equal + 0x4c (0x7f4ec4bac87c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #7: [0x7f4ec9c8e843]

  Stack Trace:
     at TorchSharp.Torch.CheckForErrors()
   at TorchSharp.Tensor.TorchTensor.Equal(TorchTensor target)
   at DiffSharp.Backends.Torch.TorchRawTensor.Equals(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 311
   at DiffSharp.Tensor.Equals(Object other) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 220
   at NUnit.Framework.Constraints.NUnitEqualityComparer.AreEqual(Object x, Object y, Tolerance& tolerance, Boolean topLevelComparison)
   at NUnit.Framework.Constraints.EqualConstraint.ApplyTo[TActual](TActual actual)
   at NUnit.Framework.Assert.That[TActual](TActual actual, IResolveConstraint expression, String message, Object[] args)
   at Tests.TestModel.TestModelClone() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 241

pkese commented 4 years ago

Intersting... After replacing TorchRawTensor.Equals with

    override t.Equals(t2:RawTensor) : bool = 
        if dtype = t2.Dtype then
            let r1 = (shape = t2.Shape)
            if not r1 then false else
            let tt2 = t2.MoveTo(device).TorchTensor
            let r2 = t.MoveTo(device).TorchTensor.Equal(tt2)
            r2
        else 
            opNotSupported2 "Equals" dtype t2.Dtype

(I've cast both sides to device)

...I'm now getting different 3 test failures:

  X TestModelClone [31ms]
  Error Message:
   System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #2 'mat2' in call to _th_mm (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f5e8a344536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7f5de1deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x10539df (0x7f5de1e2e9df in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf76dc8 (0x7f5de1d51dc8 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x10c3ec0 (0x7f5e1e327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0x2c9b6fe (0x7f5e1feff6fe in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #6: <unknown function> + 0x10c3ec0 (0x7f5e1e327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: at::Tensor c10::Dispatcher::callUnboxedWithDispatchKey<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x17c (0x7f5e8a7c3a6c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #8: at::Tensor::mm(at::Tensor const&) const + 0xa2 (0x7f5e8a83eed2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #9: THSTensor_mm + 0x5d (0x7f5e8a82fb7d in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: [0x7f5e8ec4b287]

  Stack Trace:
     at TorchSharp.Torch.CheckForErrors()
   at TorchSharp.Tensor.TorchTensor.Mm(TorchTensor target)
   at DiffSharp.Backends.Torch.TorchRawTensor.MatMulT2T2(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 483
   at <StartupCode$DiffSharp-Core>.$Tensor.fRaw@757-17.Invoke(Tuple`2 tupledArg) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 757
   at DiffSharp.Tensor.matmul(Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 765
   at DiffSharp.DiffSharp.matmul(Tensor a, Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 97
   at DiffSharp.Model.Linear.forward(Tensor value) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Model.fs:line 145
   at Tests.ModelStyle1a.forward(Tensor x) in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 15
   at DiffSharp.Model.Model.op_MinusMinusGreater(Tensor t, Model m) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Model.fs:line 115
   at Tests.TestModel.TestModelClone() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 243

All are in TorchSharp.Tensor.TorchTensor.Mm

pkese commented 4 years ago

... and then after changing

override t1.MatMulT2T2(t2) = 
        match dtype with 
        | Dtype.Bool -> opNotSupported2 "MatMulT2T2" t1.Dtype t2.Dtype
        | _ ->  
        Shape.checkCanMatmul t1.Shape t2.Shape
        let tt' = Utils.torchMoveTo tt device
        let result = tt'.Mm(t2.MoveTo(device).TorchTensor)
        t1.MakeLike(result, [| t1.Shape.[0]; t2.Shape.[1] |])

it's

  X TestModelClone [32ms]
  Error Message:
   System.Runtime.InteropServices.ExternalException : expected device cuda:0 but got device cpu (compute_types at /pytorch/aten/src/ATen/native/TensorIterator.cpp:246)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f81ed94a536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: at::TensorIterator::compute_types() + 0x17d4 (0x7f817a09fc74 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #2: at::TensorIterator::build() + 0x44 (0x7f817a0a1b64 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #3: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x146 (0x7f817a0a2216 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #4: at::native::add(at::Tensor const&, at::Tensor const&, c10::Scalar) + 0x45 (0x7f8179dc10a5 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0xf74c65 (0x7f813dd4fc65 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #6: <unknown function> + 0x10c599b (0x7f817a32999b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: <unknown function> + 0x2c0c428 (0x7f817be70428 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #8: <unknown function> + 0x10c599b (0x7f817a32999b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #9: at::Tensor c10::KernelFunction::callUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&, c10::Scalar>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&, c10::Scalar) const + 0x134 (0x7f81eddc8b24 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: at::Tensor c10::Dispatcher::callUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&, c10::Scalar>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&, c10::Scalar) const + 0x12c (0x7f81eddc89bc in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #11: THSTensor_add + 0xae (0x7f81ede2708e in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #12: [0x7f81f332381a]

  Stack Trace:
     at TorchSharp.Torch.CheckForErrors()
   at TorchSharp.Tensor.TorchTensor.Add(TorchTensor target, Scalar alpha)
   at TorchSharp.Tensor.TorchTensor.Add(TorchTensor target)
   at DiffSharp.Backends.Torch.TorchRawTensor.AddT2T1(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 389
   at <StartupCode$DiffSharp-Core>.$Tensor.fRaw@529-3.Invoke(Tuple`2 tupledArg) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 529
   at DiffSharp.Tensor.op_Addition(Tensor a, Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 537
   at DiffSharp.Model.Linear.forward(Tensor value) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Model.fs:line 146
   at Tests.ModelStyle1a.forward(Tensor x) in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 15
   at DiffSharp.Model.Model.op_MinusMinusGreater(Tensor t, Model m) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Model.fs:line 115
   at Tests.TestModel.TestModelClone() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 243

...turtles all the way down.

I'm wondering if there's a more generic approach rather than converting tensors to correct device before each invocation.

dsyme commented 4 years ago

Yay, you're now at Total tests: 266 Passed: 263 Failed: 3

That's great!

So which tests fail? I only see TestModelClone above

pkese commented 4 years ago

TestModelClone
TestModelSaveLoad
TestModelSaveLoadParameters

...all with same error in TorchSharp.Tensor.TorchTensor.Add...

...or TorchSharp.Tensor.TorchTensor.Mm or TorchSharp.Tensor.TorchTensor.Equal

pkese commented 4 years ago

One way for specifying whether to use GPU or not is to set CUDA_VISIBLE_DEVICES environment variable. Apparently PyTorch (including the one packaged in TorchSharp) is respecting this variable.

So if I set CUDA_VISIBLE_DEVICES=-1 (-1 is a way to disable Cuda) then DiffSharp says Unhandled exception. System.InvalidOperationException: CUDA non available in the current machine. Otherwise when CUDA_VISIBLE_DEVICES=0 or omitted it takes the default GPU and works as expected.

We could detect when CUDA_VISIBLE_DEVICES is set to -1 and default to CPU rendering even if libtorch-cuda-10.2-linux-x64 is installed.

So the possible logic would be:
1) libtorch-cuda installed 1.1) CUDA_VISIBLE_DEVICES >= 0 -> Use GPU 1.2) CUDA_VISIBLE_DEVICES < 0 -> Use CPU 2) libtorch-cuda not installed -> Use CPU

This could be a way to automate tests (e.g. run them twice once with and once without CUDA).

dsyme commented 4 years ago

TestModelClone

I see the problem. The save/load implied by the clone has, I think, moved everything to be CPU tensors. I believe https://github.com/DiffSharp/DiffSharp/pull/121 fixes it

One way for specifying whether to use GPU or not is to set CUDA_VISIBLE_DEVICES environment variable....

For defaults we should follow the PyTorch behaviour, yes.

Just to check, for PyTorch, when libtorch_cuda and CUDA_VISIBLE_DEVICES >= 0 then does torch.tensor( [ 0,1,2,3 ]) create a CPU or GPU tensor by default (with no explicit configuration)?

For in-repo testing I guess we should have Combos respect at least CUDA_VISIBLE_DEVICES. It's a little hard to have DiffSharp.Tests reference both libtorch-cpu and libtorch-cuda unfortunately, the test project really needs to reference one or another. I guess we could have two test .fsproj - one referencing libtorch-cpu and the other libtorch-cuda, and then in Combos detect IsCudaAvailable()

For CI testing it will depend on us getting GPU machines in CI devops.

dsyme commented 4 years ago

Let's continue this in the DiffSharp repo https://github.com/DiffSharp/DiffSharp/issues/122

dsyme commented 4 years ago

Closing as we've established the TorchSharp packages are working