Closed gbaydin closed 4 years ago
Note that some months ago I could successfully create CUDA tensors with TorchSharp, relying on a manually installed libtorch in my system and setting LD_LIBRARY_PATH to point to the folder holding libtorch.so and the other libtorch library files. Note that this doesn't work now with the latest setup.
This approach should work if you simply don't reference libtorch-cuda-10.2-linux-x64 1.5.0
- just reference TorchSharp and get the native binaries via LD_LIBRARY_PATH
. It would be great if we can confirm this
Ok, I've given it a second look and it does actually run as expected with an external installation of libtorch (without referencing the libtorch-cuda-10.2-linux-x64
nuget package). The problem was that my PyTorch installation was version 1.4.0. When I upgraded to 1.5.0, TorchSharp worked as expected.
The setup working for me is below. This is an F# console program referencing only TorchSharp
package version 0.3.52216 and no other packages.
open TorchSharp.Tensor
[<EntryPoint>]
let main argv =
let a = TorchSharp.Torch.IsCudaAvailable()
printfn "%A" a
let t = FloatTensor.RandomN([|10L|], device="cpu")
printfn "%A" t
let t2 = FloatTensor.RandomN([|10L|], device="cuda:0")
printfn "%A" t2
0 // return an integer exit code
Outputs the following
true
[10], device = cpu
[10], device = cuda
I have LD_LIBRARY_PATH
set to include home/gunes/anaconda3/lib/python3.7/site-packages/torch/lib/
where I have the libtorch files that came with a standard PyTorch 1.5.0 installation through the normal pip install
process.
-rwxrwxr-x 1 gunes gunes 225008 Jun 1 21:19 libc10_cuda.so*
-rwxrwxr-x 1 gunes gunes 472728 Jun 1 21:19 libc10.so*
-rwxrwxr-x 1 gunes gunes 1884384 Jun 1 21:19 libcaffe2_detectron_ops_gpu.so*
-rwxrwxr-x 1 gunes gunes 75768 Jun 1 21:19 libcaffe2_module_test_dynamic.so*
-rwxrwxr-x 1 gunes gunes 22016 Jun 1 21:19 libcaffe2_nvrtc.so*
-rwxrwxr-x 1 gunes gunes 118640 Jun 1 21:19 libcaffe2_observers.so*
-rwxrwxr-x 1 gunes gunes 523816 Jun 1 21:19 libcudart-80664282.so.10.2*
-rwxrwxr-x 1 gunes gunes 168720 Jun 1 21:19 libgomp-7c85b1e2.so.1*
-rwxrwxr-x 1 gunes gunes 22045456 Jun 1 21:19 libnvrtc-08c4863f.so.10.2*
-rwxrwxr-x 1 gunes gunes 4862944 Jun 1 21:19 libnvrtc-builtins.so*
-rwxrwxr-x 1 gunes gunes 43520 Jun 1 21:19 libnvToolsExt-3965bdd0.so.1*
-rwxrwxr-x 1 gunes gunes 41592 Jun 1 21:19 libshm.so*
-rwxrwxr-x 1 gunes gunes 267175432 Jun 1 21:19 libtorch_cpu.so*
-rwxrwxr-x 1 gunes gunes 1056836368 Jun 1 21:19 libtorch_cuda.so*
-rwxrwxr-x 1 gunes gunes 16760 Jun 1 21:19 libtorch_global_deps.so*
-rwxrwxr-x 1 gunes gunes 16535688 Jun 1 21:19 libtorch_python.so*
-rwxrwxr-x 1 gunes gunes 116240 Jun 1 21:19 libtorch.so*
I'm sharing the full file list thinking it might help with debugging the problem. In this Ubuntu 20.04 system I have CUDA version 10.2.
That's great I'm checking with #144 that the CUDA binaries we download pass the TorchSharp tests, I think they will. That will mean the problem is somewhere in the packaging or how the binaries are placed in the application.
If possible could you try making an application again that references TorchSharp 0.3.52216
and libtorch-cuda-10.2-linux-x64 1.5.0
, then clean and build, then
list the contents of the application native libraries after building e.g. ConsoleApp7\ConsoleApp7\bin\Debug\netcoreapp3.1\runtimes\linux-x64\native
, it should look similar to above
do a file comparison between the files in that directory and the files you've got above (they should be identical with libLibTorchSharp.so
added)
the executable bit may not be set but I understand that doesn't matter. Try setting it with chmod+x ....
then re-running tests
perhaps try moving the *.so to the root ConsoleApp7\ConsoleApp7\bin\Debug\netcoreapp3.1
If you like I could log on to your machine for a while and poke around.
Hi, file list and size comparison below. On the left side are the libtorch files that worked before (the ones listed in my previous message). On the right side are the files under bin/Debug/netcoreapp3.1/runtimes/linux-x64/native
resulting from referencing TorchSharp 0.3.52216
and libtorch-cuda-10.2-linux-x64 1.5.0
This is the list for bin/Debug/netcoreapp3.1/runtimes/linux-x64/native
-rwxrw-r-- 1 gunes gunes 225008 May 24 16:18 libc10_cuda.so*
-rwxrw-r-- 1 gunes gunes 35088 May 24 16:18 libc10d_cuda_test.so*
-rwxrw-r-- 1 gunes gunes 472728 May 24 16:18 libc10.so*
-rwxrw-r-- 1 gunes gunes 1884384 May 24 16:18 libcaffe2_detectron_ops_gpu.so*
-rwxrw-r-- 1 gunes gunes 75768 May 24 16:18 libcaffe2_module_test_dynamic.so*
-rwxrw-r-- 1 gunes gunes 22016 May 24 16:18 libcaffe2_nvrtc.so*
-rwxrw-r-- 1 gunes gunes 118640 May 24 16:18 libcaffe2_observers.so*
-rwxrw-r-- 1 gunes gunes 523816 May 24 16:18 libcudart-80664282.so.10.2*
-rwxrw-r-- 1 gunes gunes 346296 May 24 16:18 libfbjni.so*
-rwxrw-r-- 1 gunes gunes 168720 May 24 16:18 libgomp-7c85b1e2.so.1*
-rwxrw-r-- 1 gunes gunes 1416368 May 24 16:19 libLibTorchSharp.so*
-rwxrw-r-- 1 gunes gunes 22045456 May 24 16:18 libnvrtc-08c4863f.so.10.2*
-rwxrw-r-- 1 gunes gunes 4862944 May 24 16:18 libnvrtc-builtins.so*
-rwxrw-r-- 1 gunes gunes 43520 May 24 16:18 libnvToolsExt-3965bdd0.so.1*
-rwxrw-r-- 1 gunes gunes 312352 May 24 16:18 libpytorch_jni.so*
-rwxrw-r-- 1 gunes gunes 41592 May 24 16:18 libshm.so*
-rwxrw-r-- 1 gunes gunes 267175432 May 24 16:18 libtorch_cpu.so*
-rw------- 1 gunes gunes 900000000 May 28 02:00 libtorch_cuda.so
-rwxrw-r-- 1 gunes gunes 16760 May 24 16:18 libtorch_global_deps.so*
-rwxrw-r-- 1 gunes gunes 16535688 May 24 16:18 libtorch_python.so*
-rwxrw-r-- 1 gunes gunes 116240 May 24 16:18 libtorch.so*
In the nuget version there are some extra files libc10d_cuda_test.so
, libfbjni.so
, libpytorch_jni.so
and I think the important-looking giant file libtorch_cuda.so
seems somehow "truncated". Perhaps this has something to do with the package parts and fragments I see when I click on "dependencies" here: https://www.nuget.org/packages/libtorch-cuda-10.2-linux-x64/
libtorch-cuda-10.2-linux-x64-part1 (>= 1.5.0)
libtorch-cuda-10.2-linux-x64-part2-fragment1 (>= 1.5.0)
libtorch-cuda-10.2-linux-x64-part2-fragment2 (>= 1.5.0)
libtorch-cuda-10.2-linux-x64-part2-fragment3 (>= 1.5.0)
libtorch-cuda-10.2-linux-x64-part2-primary (>= 1.5.0)
The following are probably not very likely to work before fixing the truncated libtorch_cuda.so
file, but I still tried:
Setting the file permissions to be the same (-rwxrwxr-x
) with the files that worked before didn't work. Copying all files under bin/Debug/netcoreapp3.1/runtimes/linux-x64/native
to bin/Debug/netcoreapp3.1
didn't work. "Didn't work" means the program silently fails in the way described in the first message in this issue.
One last thing I tried was to copy the "good" (working) libtorch_cuda.so
file (with size 1056836368
) to bin/Debug/netcoreapp3.1/runtimes/linux-x64/native
and replace the broken libtorch_cuda.so
file (with size 900000000
).
This does not work if I run the console app with dotnet run
(which before execution replaces the libtorch_cuda.so
file again with the broken version from the nuget package).
But it does run successfully if I just run the console app executable in the folder bin/Debug/netcoreapp3.1
that was built previously.
Thanks, yes, this has isolated the problem, I can see what the fix is.
I can't yet see what went wrong here, though the final size of this binary is definitely wrong, and indicates that one of the packages was missing:
-rw------- 1 gunes gunes 900000000 May 28 02:00 libtorch_cuda.so
It seems the problem must have been in the delivery of packages - perhaps one failed to download but the build continued.
I'll add some checking for hash sum etc.
When they are delivered to my machine they result in a binary of the correct size (it's not identical to
The reconsituted file on my machine:
$ ls -Flas /c/Users/dsyme/source/repos/ConsoleApp7/ConsoleApp7/bin/Debug/netcoreapp3.1/runtimes/linux-x64/native
1032064 -rw-r--r-- 1 dsyme 1049089 1056832272 Jun 1 19:44 libtorch_cuda.so
The original file I downloaded:
$ ls -Flas /c/GitHub/dsyme/libtorch-cuda-10.2/libtorch-shared-with-deps-1.5.0/libtorch/lib/libtorch_cuda.so
1032064 -rw-r--r-- 1 dsyme 1049089 1056832272 Apr 21 01:32 /c/GitHub/dsyme/libtorch-cuda-10.2/libtorch-shared-with-deps-1.5.0/libtorch/lib/libtorch_cuda.so
I did a git clean -fdx
and dotnet build
on DiffSharp repo and I'm getting correct size libtorch_cuda.so
: 1056832272 (on Linux)
I don't quite know how to test it though. Running tests/Test
does not use any GPU and besides causes an out-of-memory after a few batches.
When I force dsharp.config(backend=Backend.Torch, device=Device.GPU)
I get an exception saying CUDA non available in the current machine
.
I did a git clean -fdx and dotnet build on DiffSharp repo and I'm getting correct size libtorch_cuda.so: 1056832272 (on Linux)
Thanks for trying!
The Test.fsproj
in dev
is not quite right, it currently has this:
<PackageReference Include="libtorch-cpu" Version="$(LibTorchVersion)" />
<PackageReference Include="libtorch-cuda-10.2-linux-x64" Version="$(LibTorchVersion)" />
However only one of these two should be used. We will have to add some kind of protection against referencing both.
CUDA non available in the current machine
I'm presuming this is because libtorch-cpu
took precedence. If you have a moment to try removing that then checking what happens taht would be great.
So I've commented out the 'libtorch-cpu' and it gets a little bit further:
Downloading "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz" to "/home/peterk/work/tmp/DiffSharp/tests/Test/data/mnist/train-images-idx3-ubyte.gz"
Downloading "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz" to "/home/peterk/work/tmp/DiffSharp/tests/Test/data/mnist/train-labels-idx1-ubyte.gz"
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at DiffSharp.Backends.Torch.TorchRawTensor.ToRawData[[System.Single, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]]()
at DiffSharp.Backends.Torch.TorchRawTensor.ToRawData()
at DiffSharp.Backends.Torch.TorchRawTensor.System-Runtime-Serialization-ISerializable-GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
at System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.InitSerialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter, System.Runtime.Serialization.SerializationBinder)
at System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Write(System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo, System.Runtime.Serialization.Formatters.Binary.NameInfo, System.Runtime.Serialization.Formatters.Binary.NameInfo)
at System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Serialize(System.Object, System.Runtime.Serialization.Formatters.Binary.BinaryFormatterWriter, Boolean)
at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(System.IO.Stream, System.Object, Boolean)
at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(System.IO.Stream, System.Object)
at DiffSharp.Util.saveBinary[[System.__Canon, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](System.__Canon, System.String)
at DiffSharp.Tensor.save(System.String)
at DiffSharp.DiffSharp.save(DiffSharp.Tensor, System.String)
at DiffSharp.Data.MNIST..ctor(System.String, Microsoft.FSharp.Core.FSharpOption`1<System.Collections.Generic.IEnumerable`1<System.String>>, Microsoft.FSharp.Core.FSharpOption`1<Boolean>, Microsoft.FSharp.Core.FSharpOption`1<Microsoft.FSharp.Core.FSharpFunc`2<DiffSharp.Tensor,DiffSharp.Tensor>>, Microsoft.FSharp.Core.FSharpOption`1<Microsoft.FSharp.Core.FSharpFunc`2<DiffSharp.Tensor,DiffSharp.Tensor>>)
at Program.main(System.String[])
Looks like a MNIST loader issue with Torch.
OK thanks, yes that's getting further. Saving GPU tensors is evidently busted (or perhaps that's by design and it's just not giving a good error message).
Could you send a PR to do the follow?
TorchRawTensor ToRawData
to give a good error message for GPU tensorstensor.save
to always move the tensor to CPU first (double check that's what PyTorch does)If that doesn't unblock then try removing the dsharp.save
calls in Data.fs (I think they're just there to reduce cost with making/reloading data)
I'm adjusting the TorchSharp packages so that
libtorch-cpu
and libtorch-cuda-*
gives this:Error Two TorchSharp runtime packages have been referenced (both libtorch-cpu and libtorch-cuda) ConsoleApp6 C:\Users\dsyme\.nuget\packages\libtorch-cpu\1.5.3\buildTransitive\netstandard2.0\libtorch-cpu.targets 6
I'll look into the ToRawData
thing.
In the meanwhile I've tried to replace libtorch-cpu
with libtorch-cuda-10.2-linux-x64
on the normal DiffSharp.Tests project and added dsharp.config(backend=Backend.Torch, device=Device.GPU)
into the test fixture and I'm getting:
The active test run was aborted. Reason: Test host process crashed : terminate called after throwing an instance of 'c10::Error'
what(): Expected one of cpu, cuda, mkldnn, opengl, opencl, ideep, hip, msnpu device type at start of device string: gpu (parse_type at /pytorch/c10/core/Device.cpp:37)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fe315030536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1a060 (0x7fe31501d060 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #2: c10::Device::Device(std::string const&) + 0x1e4 (0x7fe31501d4c4 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #3: <unknown function> + 0xd00f3 (0x7fe3155410f3 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #4: THSTensor_ones + 0x91 (0x7fe315502e41 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #5: [0x7fe319bf120c]
Could it be that 'cuda' string was expected rather than 'gpu'.
BTW, I'm rather new to both DiffSharp as well as Torch (this is my first test) so take my reports with a grain of salt. I'm normally using Tensorflow for machine learning.
Yup well, you're coming in to quite a raw branch :-) We're in the middle of getting this to boot up :)
Change this:
| Device.GPU -> "gpu"
to
| Device.GPU -> "cuda"
thanks
So I did that "gpu" -> "cuda" change and the error is indeed different:
X TestCurl [17ms]
Error Message:
System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fb9548fc536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7fb8f1deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x10493d7 (0x7fb8f1e243d7 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf96a7b (0x7fb8f1d71a7b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x10c5c23 (0x7fb92e329c23 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0x2b4b952 (0x7fb92fdaf952 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #6: <unknown function> + 0x10c5c23 (0x7fb92e329c23 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: at::Tensor c10::KernelFunction::callUnboxed<at::Tensor, at::Tensor const&, long, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, long, at::Tensor const&) const + 0x14d (0x7fb954dfe2ad in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #8: at::Tensor c10::Dispatcher::callUnboxed<at::Tensor, at::Tensor const&, long, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, long, at::Tensor const&) const + 0xf6 (0x7fb954dfe136 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #9: THSTensor_index_select + 0x87 (0x7fb954dd7a67 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: [0x7fb9a152a509]
Stack Trace:
at TorchSharp.Torch.CheckForErrors()
at TorchSharp.Tensor.TorchTensor.IndexSelect(Int64 dimension, TorchTensor index)
at DiffSharp.Backends.Torch.TorchRawTensor.GetSlice(Int32[,] fullBounds) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 71
at DiffSharp.Tensor.GetSlice(Int32[,] bounds) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 375
at DiffSharp.Tensor.GetSlice(Int32[,] bounds) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 377
at DiffSharp.Tensor.get_Item(Int32[] index) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 384
at Tests.TestDiffSharp.fvect3vect3(Tensor x) in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 44
at <StartupCode$DiffSharp-Tests>.$TestDiffSharp.TestCurl@519.Invoke(Tensor x) in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 519
at DiffSharp.DiffSharp.evalReverseDiff(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 250
at DiffSharp.DiffSharp.fjacobian(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 298
at DiffSharp.DiffSharp.fcurl(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 341
at Tests.TestDiffSharp.TestCurl() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 519
Cool that's getting further. How many tests passed? (if any)
(nb. without IndexSelect working on GPU tensors I wouldn't expect many)
BTW good to see the reasonably useful error stacks coming out, that's also encouraging
Well, I only configured the GPU in TestDiffSharp.fs file which contains 26 tests.
What I'm getting is 12 failed and 83 passing tests.
I'll try other test files as well.
Cool thanks. 83 is pretty good for a first run. TestTensor.fs
will likely contain some failures.
Feel free to send the full lists of passing/failing tests, thanks
Ah the problem is here in IndexSelect in the Torch backend:
let idxs = LongTensor.Arange(int64 start, int64 stop, 1L)
This is creating a CPU tensor then it should be creating one with the same characteristics as the input.
I'll prep a fix, paste it here and start a PR
I've added [<SetUp>]
to all tests to configure GPU and the failing tests are:
TestDerivativeGather
TestCurl
TestCurlDivergence
TestDivergence
TestGrad
TestGradhessian
TestGradhessianv
TestGradv
TestHessian
TestHessianv
TestJacobian
They all report the same exception (quoted above).
Cool the fix should just be this:
let idxs = LongTensor.Arange(int64 start, int64 stop, 1L, device=toTorchDevice t.Device)
Applied the change and now I'm getting
The active test run was aborted. Reason: Test host process crashed : Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at TorchSharp.Tensor.TorchTensor.DataItem[[System.Int32, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]]()
at DiffSharp.Backends.Torch.TorchRawTensor.GetItem(Int32[])
at DiffSharp.Backends.Torch.TorchRawTensor.ToValuesTyped[[System.Int32, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.Int32, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](Microsoft.FSharp.Core.FSharpFunc`2<Int32,Int32>)
at DiffSharp.Backends.Torch.TorchRawTensor.ToValues()
at DiffSharp.Backends.RawTensor.ToScalar()
at DiffSharp.Tensor.toScalar()
at <StartupCode$DiffSharp-Core>.$Tensor+push@1760.Invoke(Microsoft.FSharp.Collections.FSharpList`1<System.Tuple`2<DiffSharp.Tensor,DiffSharp.Tensor>>)
at DiffSharp.Tensor.reversePush(DiffSharp.Tensor)
at DiffSharp.Tensor.reverse(Microsoft.FSharp.Core.FSharpOption`1<DiffSharp.Tensor>, Microsoft.FSharp.Core.FSharpOption`1<Boolean>)
at Tests.TestDerivatives.TestDerivativeGather()
at System.RuntimeMethodHandle.InvokeMethod(System.Object, System.Object[], System.Signature, Boolean, Boolean)
at System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)
OK, yes, this is DataItem on a GPU tensor again.
Collected fixes are here, thanks https://github.com/DiffSharp/DiffSharp/pull/119, I think it should include fixes for all of the above.
@pkese I have merged those fixes to dev if you want to pull and give it another crack.
I'll also add a bug to TorchSharp about TorchSharp.Tensor.TorchTensor.DataItem
giving a hard crash when ued on a GPU tensor - it should at least give a decent exception.
After applying the DiffSharp#119 there are many more tests passing:
Total tests: 266
Passed: 241
Failed: 25
There are two common error types:
X TestGrad [102ms]
Error Message:
Expected: <Tensor [-149.000000, 50.000000]>
But was: <Tensor [0.000000, 0.000000]>
Stack Trace:
at Tests.TestDiffSharp.TestGrad() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 192
X TestGradhessian [104ms]
Error Message:
Expected: <Tensor [[1702.000000, -600.000000],
[-600.000000, 200.000000]]>
But was: <Tensor [[0.000000, 0.000000],
[0.000000, 0.000000]]>
Stack Trace:
at Tests.TestDiffSharp.TestGradhessian() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 435
X TestGradhessianv [34ms]
Error Message:
Expected: <Tensor [2051.000000, -700.000000]>
But was: <Tensor [0.000000, 0.000000]>
Stack Trace:
at Tests.TestDiffSharp.TestGradhessianv() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 408
X TestHessian [97ms]
Error Message:
Expected: <Tensor [[1702.000000, -600.000000],
[-600.000000, 200.000000]]>
But was: <Tensor [[0.000000, 0.000000],
[0.000000, 0.000000]]>
Stack Trace:
at Tests.TestDiffSharp.TestHessian() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 492
X TestHessianv [56ms]
Error Message:
Expected: <Tensor [2051.000000, -700.000000]>
But was: <Tensor [0.000000, 0.000000]>
Stack Trace:
at Tests.TestDiffSharp.TestHessianv() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 466
X TestJacobian [100ms]
Error Message:
Expected: <Tensor [[1.000000, 4.000000, 2.000000],
[4.000000, 0.000000, 0.000000]]>
But was: <Tensor [[0.000000, 0.000000, 0.000000],
[0.000000, 0.000000, 0.000000]]>
Stack Trace:
at Tests.TestDiffSharp.TestJacobian() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 334
X TestJacobianTv [22ms]
Error Message:
Expected: <Tensor [-124.375000, -136.875000, -51.875000]>
But was: <Tensor [0.000000, 0.000000, 0.000000]>
Stack Trace:
at Tests.TestDiffSharp.TestJacobianTv() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 290
X TestLaplacian [54ms]
Error Message:
Expected: <Tensor 1902.000000>
But was: <Tensor 0.000000>
Stack Trace:
at Tests.TestDiffSharp.TestLaplacian() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 512
and
X TestOne [7ms]
Error Message:
System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #2 'other' in call to _th_equal (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fbb55af5536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7fbaa5deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x1044cd2 (0x7fbaa5e1fcd2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf79e80 (0x7fbaa5d54e80 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x2b2b34c (0x7fbae3d8f34c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: bool c10::Dispatcher::callUnboxedWithDispatchKey<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x181 (0x7fbb55ffd291 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #6: THSTensor_equal + 0x4c (0x7fbb55fdc87c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #7: [0x7fbb5b08e843]
Stack Trace:
at TorchSharp.Torch.CheckForErrors()
at TorchSharp.Tensor.TorchTensor.Equal(TorchTensor target)
at DiffSharp.Backends.Torch.TorchRawTensor.Equals(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 311
at DiffSharp.Tensor.Equals(Object other) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 220
at NUnit.Framework.Constraints.NUnitEqualityComparer.AreEqual(Object x, Object y, Tolerance& tolerance, Boolean topLevelComparison)
at NUnit.Framework.Constraints.EqualConstraint.ApplyTo[TActual](TActual actual)
at NUnit.Framework.Assert.That[TActual](TActual actual, IResolveConstraint expression, String message, Object[] args)
at Tests.TestDiffSharp.TestOne() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestDiffSharp.fs:line 77
the last one is present at TestOne, TestZero, TestModelClone, TestModelLinear, TestModelParametersDiff, TestModelSaveLoad, TestModelSaveLoadParameters and some Optimizer tests.
...I'm not sure if some of these errors are appearing because I forced
dsharp.config(backend=Backend.Torch, device=Device.GPU)
in all test Setups.
If I remove that then all tests pass, but apparently GPU is not being used.
Even the tests/Test
now starts.
After a while it reports OOM:
net params: 1199882
Torch
Duration |Iters| Ep| Minib| Loss
0.00:00:03 | 1 | 1 | 1/937 | 2.316844e+000 🡾 New min
Unhandled exception. System.Runtime.InteropServices.ExternalException (0x80004005): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 5.93 GiB total capacity; 4.86 GiB already allocated; 192.00 KiB free; 5.04 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
but at least it is consistent with the CPU version which also runs out of memory.
It appears that there are several cases that miss proper conversions. Above is _th_equal
, but there's also
_th_mm:
X TestModelParametersDiff [7ms]
Error Message:
System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_mm (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fbb55af5536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7fbaa5deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x10539b9 (0x7fbaa5e2e9b9 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf76dc8 (0x7fbaa5d51dc8 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x10c3ec0 (0x7fbae2327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0x2c9b6fe (0x7fbae3eff6fe in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #6: <unknown function> + 0x10c3ec0 (0x7fbae2327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: at::Tensor c10::Dispatcher::callUnboxedWithDispatchKey<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x17c (0x7fbb55f74a6c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #8: at::Tensor::mm(at::Tensor const&) const + 0xa2 (0x7fbb55fefed2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #9: THSTensor_mm + 0x5d (0x7fbb55fe0b7d in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: [0x7fbb5b813261]
Stack Trace:
at TorchSharp.Torch.CheckForErrors()
at TorchSharp.Tensor.TorchTensor.Mm(TorchTensor target)
at DiffSharp.Backends.Torch.TorchRawTensor.MatMulT2T2(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 483
at <StartupCode$DiffSharp-Core>.$Tensor.fRaw@757-17.Invoke(Tuple`2 tupledArg) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 757
at DiffSharp.Tensor.matmul(Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 765
at <StartupCode$DiffSharp-Core>.$Tensor.push@1760.Invoke(FSharpList`1 ts) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1811
at DiffSharp.Tensor.reversePush(Tensor value) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1918
at DiffSharp.Tensor.reverse(FSharpOption`1 value, FSharpOption`1 zeroDerivatives) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1639
at DiffSharp.DiffSharp.reverse(Tensor value, Tensor tensor) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 246
at <StartupCode$DiffSharp-Core>.$DiffSharp.r@251.Invoke(Tensor v) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 251
at DiffSharp.DiffSharp.fgrad(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 311
at DiffSharp.DiffSharp.grad(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 312
at Tests.TestModel.TestModelParametersDiff() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 148
and multiple cases where TorchSharp.Tensor.TorchTensor.Mul fails:
X TestModelLinear [6ms]
Error Message:
System.Runtime.InteropServices.ExternalException : expected device cpu but got device cuda:0 (compute_types at /pytorch/aten/src/ATen/native/TensorIterator.cpp:246)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fbb55af5536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: at::TensorIterator::compute_types() + 0x17d4 (0x7fbae209fc74 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #2: at::TensorIterator::build() + 0x44 (0x7fbae20a1b64 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #3: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x146 (0x7fbae20a2216 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #4: at::native::mul(at::Tensor const&, at::Tensor const&) + 0x3a (0x7fbae1dc1eba in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0xf76ef8 (0x7fbaa5d51ef8 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #6: <unknown function> + 0x10c3ec0 (0x7fbae2327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: <unknown function> + 0x2d2e779 (0x7fbae3f92779 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #8: <unknown function> + 0x10c3ec0 (0x7fbae2327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #9: at::Tensor c10::Dispatcher::callUnboxedWithDispatchKey<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x17c (0x7fbb55f74a6c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: at::Tensor::mul(at::Tensor const&) const + 0xa2 (0x7fbb55fefff2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #11: THSTensor_mul + 0x5d (0x7fbb55fe0dbd in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #12: [0x7fbb5b7cc071]
Stack Trace:
at TorchSharp.Torch.CheckForErrors()
at TorchSharp.Tensor.TorchTensor.Mul(TorchTensor target)
at DiffSharp.Backends.Torch.TorchRawTensor.MulTT(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 425
at <StartupCode$DiffSharp-Core>.$Tensor.fRaw@615-8.Invoke(Tuple`2 tupledArg) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 615
at DiffSharp.Tensor.op_Multiply(Tensor a, Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 623
at <StartupCode$DiffSharp-Core>.$Tensor.push@1760.Invoke(FSharpList`1 ts) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1788
at DiffSharp.Tensor.reversePush(Tensor value) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1918
at DiffSharp.Tensor.reverse(FSharpOption`1 value, FSharpOption`1 zeroDerivatives) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 1639
at DiffSharp.DiffSharp.reverse(Tensor value, Tensor tensor) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 246
at <StartupCode$DiffSharp-Core>.$DiffSharp.r@251.Invoke(Tensor v) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 251
at DiffSharp.DiffSharp.fgrad(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 311
at DiffSharp.DiffSharp.grad(FSharpFunc`2 f, Tensor x) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 312
at Tests.TestModel.TestModelLinear() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 258
Cool thank you!
I think https://github.com/DiffSharp/DiffSharp/pull/120 should address half of those. Possibly more.
It's hard to tell why the grad/Jacobian tests etc. are failing by just giving zeros - they exercise quite a lot of functionality - it could plausibly be the same root cause though I expect one or two more glitches.
...I'm not sure if some of these errors are appearing because I forced
dsharp.config(backend=Backend.Torch, device=Device.GPU)
in all test Setups.
It is quite a stress test! I was planning on gingerly turning things on test by test but this is much more effective at flushing out bugs :-)
I merged https://github.com/DiffSharp/DiffSharp/pull/120 if you want to try it again, thanks
Yay, you're now at
Total tests: 266
Passed: 263
Failed: 3
All that remained are 3 identical cases of
X TestModelClone [32ms]
Error Message:
System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #2 'other' in call to _th_equal (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f4ec46c5536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7f4e15deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x1044cd2 (0x7f4e15e1fcd2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf79e80 (0x7f4e15d54e80 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x2b2b34c (0x7f4e53d8f34c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: bool c10::Dispatcher::callUnboxedWithDispatchKey<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x181 (0x7f4ec4bcd291 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #6: THSTensor_equal + 0x4c (0x7f4ec4bac87c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #7: [0x7f4ec9c8e843]
Stack Trace:
at TorchSharp.Torch.CheckForErrors()
at TorchSharp.Tensor.TorchTensor.Equal(TorchTensor target)
at DiffSharp.Backends.Torch.TorchRawTensor.Equals(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 311
at DiffSharp.Tensor.Equals(Object other) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 220
at NUnit.Framework.Constraints.NUnitEqualityComparer.AreEqual(Object x, Object y, Tolerance& tolerance, Boolean topLevelComparison)
at NUnit.Framework.Constraints.EqualConstraint.ApplyTo[TActual](TActual actual)
at NUnit.Framework.Assert.That[TActual](TActual actual, IResolveConstraint expression, String message, Object[] args)
at Tests.TestModel.TestModelClone() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 241
Intersting... After replacing TorchRawTensor.Equals with
override t.Equals(t2:RawTensor) : bool =
if dtype = t2.Dtype then
let r1 = (shape = t2.Shape)
if not r1 then false else
let tt2 = t2.MoveTo(device).TorchTensor
let r2 = t.MoveTo(device).TorchTensor.Equal(tt2)
r2
else
opNotSupported2 "Equals" dtype t2.Dtype
(I've cast both sides to device)
...I'm now getting different 3 test failures:
X TestModelClone [31ms]
Error Message:
System.Runtime.InteropServices.ExternalException : Expected object of device type cuda but got device type cpu for argument #2 'mat2' in call to _th_mm (checked_dense_tensor_unwrap at /pytorch/aten/src/ATen/Utils.h:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f5e8a344536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: <unknown function> + 0x1013b1b (0x7f5de1deeb1b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #2: <unknown function> + 0x10539df (0x7f5de1e2e9df in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #3: <unknown function> + 0xf76dc8 (0x7f5de1d51dc8 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #4: <unknown function> + 0x10c3ec0 (0x7f5e1e327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0x2c9b6fe (0x7f5e1feff6fe in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #6: <unknown function> + 0x10c3ec0 (0x7f5e1e327ec0 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: at::Tensor c10::Dispatcher::callUnboxedWithDispatchKey<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKey, at::Tensor const&, at::Tensor const&) const + 0x17c (0x7f5e8a7c3a6c in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #8: at::Tensor::mm(at::Tensor const&) const + 0xa2 (0x7f5e8a83eed2 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #9: THSTensor_mm + 0x5d (0x7f5e8a82fb7d in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: [0x7f5e8ec4b287]
Stack Trace:
at TorchSharp.Torch.CheckForErrors()
at TorchSharp.Tensor.TorchTensor.Mm(TorchTensor target)
at DiffSharp.Backends.Torch.TorchRawTensor.MatMulT2T2(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 483
at <StartupCode$DiffSharp-Core>.$Tensor.fRaw@757-17.Invoke(Tuple`2 tupledArg) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 757
at DiffSharp.Tensor.matmul(Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 765
at DiffSharp.DiffSharp.matmul(Tensor a, Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 97
at DiffSharp.Model.Linear.forward(Tensor value) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Model.fs:line 145
at Tests.ModelStyle1a.forward(Tensor x) in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 15
at DiffSharp.Model.Model.op_MinusMinusGreater(Tensor t, Model m) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Model.fs:line 115
at Tests.TestModel.TestModelClone() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 243
All are in TorchSharp.Tensor.TorchTensor.Mm
... and then after changing
override t1.MatMulT2T2(t2) =
match dtype with
| Dtype.Bool -> opNotSupported2 "MatMulT2T2" t1.Dtype t2.Dtype
| _ ->
Shape.checkCanMatmul t1.Shape t2.Shape
let tt' = Utils.torchMoveTo tt device
let result = tt'.Mm(t2.MoveTo(device).TorchTensor)
t1.MakeLike(result, [| t1.Shape.[0]; t2.Shape.[1] |])
it's
X TestModelClone [32ms]
Error Message:
System.Runtime.InteropServices.ExternalException : expected device cuda:0 but got device cpu (compute_types at /pytorch/aten/src/ATen/native/TensorIterator.cpp:246)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f81ed94a536 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libc10.so)
frame #1: at::TensorIterator::compute_types() + 0x17d4 (0x7f817a09fc74 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #2: at::TensorIterator::build() + 0x44 (0x7f817a0a1b64 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #3: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x146 (0x7f817a0a2216 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #4: at::native::add(at::Tensor const&, at::Tensor const&, c10::Scalar) + 0x45 (0x7f8179dc10a5 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #5: <unknown function> + 0xf74c65 (0x7f813dd4fc65 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cuda.so)
frame #6: <unknown function> + 0x10c599b (0x7f817a32999b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #7: <unknown function> + 0x2c0c428 (0x7f817be70428 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #8: <unknown function> + 0x10c599b (0x7f817a32999b in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libtorch_cpu.so)
frame #9: at::Tensor c10::KernelFunction::callUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&, c10::Scalar>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&, c10::Scalar) const + 0x134 (0x7f81eddc8b24 in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #10: at::Tensor c10::Dispatcher::callUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&, c10::Scalar>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&, c10::Scalar) const + 0x12c (0x7f81eddc89bc in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #11: THSTensor_add + 0xae (0x7f81ede2708e in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/bin/Debug/netcoreapp3.0/runtimes/linux-x64/native/libLibTorchSharp.so)
frame #12: [0x7f81f332381a]
Stack Trace:
at TorchSharp.Torch.CheckForErrors()
at TorchSharp.Tensor.TorchTensor.Add(TorchTensor target, Scalar alpha)
at TorchSharp.Tensor.TorchTensor.Add(TorchTensor target)
at DiffSharp.Backends.Torch.TorchRawTensor.AddT2T1(RawTensor t2) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 389
at <StartupCode$DiffSharp-Core>.$Tensor.fRaw@529-3.Invoke(Tuple`2 tupledArg) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 529
at DiffSharp.Tensor.op_Addition(Tensor a, Tensor b) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 537
at DiffSharp.Model.Linear.forward(Tensor value) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Model.fs:line 146
at Tests.ModelStyle1a.forward(Tensor x) in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 15
at DiffSharp.Model.Model.op_MinusMinusGreater(Tensor t, Model m) in /home/peterk/work/tmp/DiffSharp/src/DiffSharp.Core/Model.fs:line 115
at Tests.TestModel.TestModelClone() in /home/peterk/work/tmp/DiffSharp/tests/DiffSharp.Tests/TestModel.fs:line 243
...turtles all the way down.
I'm wondering if there's a more generic approach rather than converting tensors to correct device before each invocation.
Yay, you're now at Total tests: 266 Passed: 263 Failed: 3
That's great!
So which tests fail? I only see TestModelClone
above
TestModelClone
TestModelSaveLoad
TestModelSaveLoadParameters
...all with same error in TorchSharp.Tensor.TorchTensor.Add...
...or TorchSharp.Tensor.TorchTensor.Mm or TorchSharp.Tensor.TorchTensor.Equal
One way for specifying whether to use GPU or not is to set CUDA_VISIBLE_DEVICES
environment variable.
Apparently PyTorch (including the one packaged in TorchSharp) is respecting this variable.
So if I set CUDA_VISIBLE_DEVICES=-1
(-1
is a way to disable Cuda) then DiffSharp says Unhandled exception. System.InvalidOperationException: CUDA non available in the current machine.
Otherwise when CUDA_VISIBLE_DEVICES=0
or omitted it takes the default GPU and works as expected.
We could detect when CUDA_VISIBLE_DEVICES
is set to -1 and default to CPU rendering even if libtorch-cuda-10.2-linux-x64
is installed.
So the possible logic would be:
1) libtorch-cuda
installed
1.1) CUDA_VISIBLE_DEVICES >= 0 -> Use GPU
1.2) CUDA_VISIBLE_DEVICES < 0 -> Use CPU
2) libtorch-cuda
not installed -> Use CPU
This could be a way to automate tests (e.g. run them twice once with and once without CUDA).
TestModelClone
I see the problem. The save/load implied by the clone has, I think, moved everything to be CPU tensors. I believe https://github.com/DiffSharp/DiffSharp/pull/121 fixes it
One way for specifying whether to use GPU or not is to set CUDA_VISIBLE_DEVICES environment variable....
For defaults we should follow the PyTorch behaviour, yes.
Just to check, for PyTorch, when libtorch_cuda
and CUDA_VISIBLE_DEVICES >= 0
then does torch.tensor( [ 0,1,2,3 ])
create a CPU or GPU tensor by default (with no explicit configuration)?
For in-repo testing I guess we should have Combos
respect at least CUDA_VISIBLE_DEVICES
. It's a little hard to have DiffSharp.Tests
reference both libtorch-cpu
and libtorch-cuda
unfortunately, the test project really needs to reference one or another. I guess we could have two test .fsproj
- one referencing libtorch-cpu
and the other libtorch-cuda
, and then in Combos
detect IsCudaAvailable()
For CI testing it will depend on us getting GPU machines in CI devops.
Let's continue this in the DiffSharp repo https://github.com/DiffSharp/DiffSharp/issues/122
Closing as we've established the TorchSharp packages are working
Hi, I can successfully run
TorchSharp 0.3.52216
with thelibtorch-cpu 1.5.0
native runtime package. However, when I uselibtorch-cuda-10.2-linux-x64 1.5.0
, the code silently crashes when I attempt to run any TorchSharp method.For example, in F#:
works and prints
when used with
libtorch-cpu 1.5.0
. But it silently fails after the first line when used withlibtorch-cuda-10.2-linux-x64 1.5.0
with the following output:Initially, I was testing whether I can create CUDA tensors with TorchSharp, and I encountered this problem. Then I noticed even a call to
Torch.IsCudaAvailable()
or the creation of a CPU tensor also fail with this runtime. In normal usage we would expect to be able to create both CUDA and CPU tensors withlibtorch-cuda-10.2-linux-x64 1.5.0
, and the nuget package does seem to includelibtorch_cpu.so
.Note that some months ago I could successfully create CUDA tensors with TorchSharp, relying on a manually installed libtorch in my system and setting
LD_LIBRARY_PATH
to point to the folder holdinglibtorch.so
and the other libtorch library files. Note that this doesn't work now with the latest setup.I'm running these on Ubuntu 20.04 with .net core sdk 3.1.201.