Open K024 opened 11 months ago
Okay, thank you for the issue! I'll be taking the rest of the year off after today, and there's no chance of getting a fix into a release before January. A temporary workaround may be to generate random values on CPU and then move the resulting tensor to the CUDA device.
@K024:
The bug is pretty obvious -- this was a TODO in the C++ code. I hadn't discovered how to create a CUDA generator, but I believe I know how to, now.
That said, it's going to be more involved than I had hoped. Here's why:
When building the TorchSharp packages, LibTorchSharp (the native / .NET interop layer) is included in the TorchSharp package, not the backend packages, so it has only the APIs that are cross-backend available. The native interop layer links only against torch.dll and torch_cpu.dll (and the corresponding .so and .dylibs), which are available for all backends. There is a certain amount of device generality in those libraries, but most CUDA-specific APIs are not available.
So, for example, the general APIs will allow us to test whether CUDA is available, and it will allow usto get the default CUDA RNG, but not create new ones. There are other CUDA-specific APIs we would like to get to, as well.
In order to address this, LibTorchSharp will have to be built separately for each device type (CPU, CUDA, AMD in the future) and bundled with the backend packages, instead. It is certainly something we can do, but it will take time and effort.
In the meantime, we can have the Generator constructor hook everything up to the default CUDA generator, but that will share state between all such generators. The alternative is what I outlined above: create random tensors on CPU with a custom CPU generator and then move the output to GPU.
Minimal reproduction:
Output:
This also won't work:
Output:
TorchSharp: 0.101.4 libtorch loaded from conda: pytorch 2.1.0 py3.10_cuda12.1_cudnn8.9.2_0
Update:
This issue may be more complicated. The equivalent code works in python/pytorch, and the device of state tensor is exactly
cpu
with shape[16]
. A rolling offset is also used in pytorch.