Pygion: import torch gives 'duplicate registration of function'

CharlelieLrt commented 1 year ago

I am importing Pytorch with Cuda enabled into a Pygion script hello.py, and this gives me a lot of "duplicate registration of function" warning messages. Pytorch is version 1.12.0 built from source with Cuda 11.6.

hello.py :

import torch
from pygion import task

@task
def main():
    print("Hello")

if __name__ == '__main__':
    main()

And this is ran with legion_python -ll:py 1 -ll:gpu 1 This gives a lot of warnings:

[0 - 20002647f8b0]    2.282787 {4}{gpu}: duplicate registration of function _ZN2at6native13col2im_kernelIN3c104HalfEfEEvlPKT_lllllllllllllPS4_
[0 - 20002647f8b0]    2.282884 {4}{gpu}: duplicate registration of function _ZN2at6native13col2im_kernelIffEEvlPKT_lllllllllllllPS2_
[0 - 20002647f8b0]    2.282890 {4}{gpu}: duplicate registration of function _ZN2at6native13col2im_kernelIddEEvlPKT_lllllllllllllPS2_
[0 - 20002647f8b0]    2.343492 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub11EmptyKernelIvEEvv
[0 - 20002647f8b0]    2.344639 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub11EmptyKernelIvEEvv
[0 - 20002647f8b0]    2.345941 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub21DeviceScanByKeyKernelINS0_21DeviceScanByKeyPolicyIN6thrust16reverse_iteratorIPlEES6_NS0_8NullTypeEE9Policy520ES6_S6_S6_NS0_24ReduceByKeyScanTileStateIliLb1EEENS0_8EqualityENS0_3MaxES7_iEEvT0_T1_T2_T3_iT4_T5_T6_T7_
[0 - 20002647f8b0]    2.345952 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub21DeviceScanByKeyKernelINS0_21DeviceScanByKeyPolicyIPlNS0_21ConstantInputIteratorIllEENS0_8NullTypeEE9Policy520ES3_S5_S3_NS0_24ReduceByKeyScanTileStateIliLb1EEENS0_8EqualityENS0_3SumES6_iEEvT0_T1_T2_T3_iT4_T5_T6_T7_
[0 - 20002647f8b0]    2.345959 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub20DeviceScanInitKernelINS0_24ReduceByKeyScanTileStateIliLb1EEEEEvT_i
[0 - 20002647f8b0]    2.345965 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub21DeviceScanByKeyKernelINS0_21DeviceScanByKeyPolicyIN6thrust16reverse_iteratorIPiEES6_NS0_8NullTypeEE9Policy520ES6_S6_S6_NS0_24ReduceByKeyScanTileStateIiiLb1EEENS0_8EqualityENS0_3MaxES7_iEEvT0_T1_T2_T3_iT4_T5_T6_T7_
[0 - 20002647f8b0]    2.345974 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub21DeviceScanByKeyKernelINS0_21DeviceScanByKeyPolicyIPiNS0_21ConstantInputIteratorIilEENS0_8NullTypeEE9Policy520ES3_S5_S3_NS0_24ReduceByKeyScanTileStateIiiLb1EEENS0_8EqualityENS0_3SumES6_iEEvT0_T1_T2_T3_iT4_T5_T6_T7_
[0 - 20002647f8b0]    2.345981 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub20DeviceScanInitKernelINS0_24ReduceByKeyScanTileStateIiiLb1EEEEEvT_i
[0 - 20002647f8b0]    2.345987 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub11EmptyKernelIvEEvv
[0 - 20002647f8b0]    2.393659 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIN3c104HalfEEEvlPKT_llllllllllllPS4_
[0 - 20002647f8b0]    2.393689 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIfEEvlPKT_llllllllllllPS2_
[0 - 20002647f8b0]    2.393695 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIdEEvlPKT_llllllllllllPS2_
[0 - 20002647f8b0]    2.467999 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub11EmptyKernelIvEEvv
[0 - 20002647f8b0]    2.507004 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIN3c108BFloat16EEEvlPKT_llllllllllllPS4_
[0 - 20002647f8b0]    2.507028 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIN3c104HalfEEEvlPKT_llllllllllllPS4_
[0 - 20002647f8b0]    2.507034 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIfEEvlPKT_llllllllllllPS2_
[0 - 20002647f8b0]    2.507040 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIdEEvlPKT_llllllllllllPS2_
[0 - 20002647f8b0]    2.507046 {4}{gpu}: duplicate registration of function _ZN2at6native13col2im_kernelIN3c108BFloat16EfEEvlPKT_lllllllllllllPS4_
[0 - 20002647f8b0]    2.507053 {4}{gpu}: duplicate registration of function _ZN2at6native13col2im_kernelIN3c104HalfEfEEvlPKT_lllllllllllllPS4_
[0 - 20002647f8b0]    2.507082 {4}{gpu}: duplicate registration of function _ZN2at6native13col2im_kernelIffEEvlPKT_lllllllllllllPS2_
[0 - 20002647f8b0]    2.507090 {4}{gpu}: duplicate registration of function _ZN2at6native13col2im_kernelIddEEvlPKT_lllllllllllllPS2_
[0 - 20002647f8b0]    2.508191 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIN3c108BFloat16EEEvlPKT_llllllllllllPS4_
[0 - 20002647f8b0]    2.508200 {4}{gpu}: duplicate registration of function _ZN2at6native14vol2col_kernelIN3c108BFloat16EEEviPKT_iiiiiiiiiiiiiiiiiiPS4_
[0 - 20002647f8b0]    2.508208 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIN3c104HalfEEEvlPKT_llllllllllllPS4_
[0 - 20002647f8b0]    2.508213 {4}{gpu}: duplicate registration of function _ZN2at6native14vol2col_kernelIN3c104HalfEEEviPKT_iiiiiiiiiiiiiiiiiiPS4_
[0 - 20002647f8b0]    2.508221 {4}{gpu}: duplicate registration of function _ZN2at6native13col2im_kernelIffEEvlPKT_lllllllllllllPS2_
[0 - 20002647f8b0]    2.508243 {4}{gpu}: duplicate registration of function _ZN2at6native13vol2im_kernelIffEEvjPKT_jjjjjjjjjjjjjjjjjjjPS2_
[0 - 20002647f8b0]    2.508264 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIfEEvlPKT_llllllllllllPS2_
[0 - 20002647f8b0]    2.508270 {4}{gpu}: duplicate registration of function _ZN2at6native14vol2col_kernelIfEEviPKT_iiiiiiiiiiiiiiiiiiPS2_
[0 - 20002647f8b0]    2.508295 {4}{gpu}: duplicate registration of function _ZN2at6native13col2im_kernelIddEEvlPKT_lllllllllllllPS2_
[0 - 20002647f8b0]    2.508316 {4}{gpu}: duplicate registration of function _ZN2at6native13vol2im_kernelIddEEvjPKT_jjjjjjjjjjjjjjjjjjjPS2_
[0 - 20002647f8b0]    2.508336 {4}{gpu}: duplicate registration of function _ZN2at6native13im2col_kernelIdEEvlPKT_llllllllllllPS2_
[0 - 20002647f8b0]    2.508356 {4}{gpu}: duplicate registration of function _ZN2at6native14vol2col_kernelIdEEviPKT_iiiiiiiiiiiiiiiiiiPS2_
[0 - 20002647f8b0]    2.510665 {4}{gpu}: duplicate registration of function _ZN14at_cuda_detail3cub11EmptyKernelIvEEvv

Obviously the warnings don't occur if I remove the import torch. The warnings do not seem to prevent the script from running though, so I am wondering if I can simply disregard those or if it is something that should be addressed.

Edit: I realized this issue was already reported long time ago (#824) @eddy16112 @streichler I know that this issue is old, but could you identify the source of the problem at that time?

elliottslaughter commented 1 year ago

Demangled:

Demangled output from previous comment

``` [0 - 20002647f8b0] 2.282787 {4}{gpu}: duplicate registration of function void at::native::col2im_kernel(long, c10::Half const*, long, long, long, long, long, long, long, long, long, long, long, long, long, c10::Half*) [0 - 20002647f8b0] 2.282884 {4}{gpu}: duplicate registration of function void at::native::col2im_kernel(long, float const*, long, long, long, long, long, long, long, long, long, long, long, long, long, float*) [0 - 20002647f8b0] 2.282890 {4}{gpu}: duplicate registration of function void at::native::col2im_kernel(long, double const*, long, long, long, long, long, long, long, long, long, long, long, long, long, double*) [0 - 20002647f8b0] 2.343492 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::EmptyKernel() [0 - 20002647f8b0] 2.344639 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::EmptyKernel() [0 - 20002647f8b0] 2.345941 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::DeviceScanByKeyKernel, thrust::reverse_iterator, at_cuda_detail::cub::NullType>::Policy520, thrust::reverse_iterator, thrust::reverse_iterator, thrust::reverse_iterator, at_cuda_detail::cub::ReduceByKeyScanTileState, at_cuda_detail::cub::Equality, at_cuda_detail::cub::Max, at_cuda_detail::cub::NullType, int>(thrust::reverse_iterator, thrust::reverse_iterator, thrust::reverse_iterator, at_cuda_detail::cub::ReduceByKeyScanTileState, int, at_cuda_detail::cub::Equality, at_cuda_detail::cub::Max, at_cuda_detail::cub::NullType, int) [0 - 20002647f8b0] 2.345952 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::DeviceScanByKeyKernel, at_cuda_detail::cub::NullType>::Policy520, long*, at_cuda_detail::cub::ConstantInputIterator, long*, at_cuda_detail::cub::ReduceByKeyScanTileState, at_cuda_detail::cub::Equality, at_cuda_detail::cub::Sum, at_cuda_detail::cub::NullType, int>(long*, at_cuda_detail::cub::ConstantInputIterator, long*, at_cuda_detail::cub::ReduceByKeyScanTileState, int, at_cuda_detail::cub::Equality, at_cuda_detail::cub::Sum, at_cuda_detail::cub::NullType, int) [0 - 20002647f8b0] 2.345959 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::DeviceScanInitKernel >(at_cuda_detail::cub::ReduceByKeyScanTileState, int) [0 - 20002647f8b0] 2.345965 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::DeviceScanByKeyKernel, thrust::reverse_iterator, at_cuda_detail::cub::NullType>::Policy520, thrust::reverse_iterator, thrust::reverse_iterator, thrust::reverse_iterator, at_cuda_detail::cub::ReduceByKeyScanTileState, at_cuda_detail::cub::Equality, at_cuda_detail::cub::Max, at_cuda_detail::cub::NullType, int>(thrust::reverse_iterator, thrust::reverse_iterator, thrust::reverse_iterator, at_cuda_detail::cub::ReduceByKeyScanTileState, int, at_cuda_detail::cub::Equality, at_cuda_detail::cub::Max, at_cuda_detail::cub::NullType, int) [0 - 20002647f8b0] 2.345974 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::DeviceScanByKeyKernel, at_cuda_detail::cub::NullType>::Policy520, int*, at_cuda_detail::cub::ConstantInputIterator, int*, at_cuda_detail::cub::ReduceByKeyScanTileState, at_cuda_detail::cub::Equality, at_cuda_detail::cub::Sum, at_cuda_detail::cub::NullType, int>(int*, at_cuda_detail::cub::ConstantInputIterator, int*, at_cuda_detail::cub::ReduceByKeyScanTileState, int, at_cuda_detail::cub::Equality, at_cuda_detail::cub::Sum, at_cuda_detail::cub::NullType, int) [0 - 20002647f8b0] 2.345981 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::DeviceScanInitKernel >(at_cuda_detail::cub::ReduceByKeyScanTileState, int) [0 - 20002647f8b0] 2.345987 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::EmptyKernel() [0 - 20002647f8b0] 2.393659 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, c10::Half const*, long, long, long, long, long, long, long, long, long, long, long, long, c10::Half*) [0 - 20002647f8b0] 2.393689 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, float const*, long, long, long, long, long, long, long, long, long, long, long, long, float*) [0 - 20002647f8b0] 2.393695 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, double const*, long, long, long, long, long, long, long, long, long, long, long, long, double*) [0 - 20002647f8b0] 2.467999 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::EmptyKernel() [0 - 20002647f8b0] 2.507004 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, c10::BFloat16 const*, long, long, long, long, long, long, long, long, long, long, long, long, c10::BFloat16*) [0 - 20002647f8b0] 2.507028 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, c10::Half const*, long, long, long, long, long, long, long, long, long, long, long, long, c10::Half*) [0 - 20002647f8b0] 2.507034 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, float const*, long, long, long, long, long, long, long, long, long, long, long, long, float*) [0 - 20002647f8b0] 2.507040 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, double const*, long, long, long, long, long, long, long, long, long, long, long, long, double*) [0 - 20002647f8b0] 2.507046 {4}{gpu}: duplicate registration of function void at::native::col2im_kernel(long, c10::BFloat16 const*, long, long, long, long, long, long, long, long, long, long, long, long, long, c10::BFloat16*) [0 - 20002647f8b0] 2.507053 {4}{gpu}: duplicate registration of function void at::native::col2im_kernel(long, c10::Half const*, long, long, long, long, long, long, long, long, long, long, long, long, long, c10::Half*) [0 - 20002647f8b0] 2.507082 {4}{gpu}: duplicate registration of function void at::native::col2im_kernel(long, float const*, long, long, long, long, long, long, long, long, long, long, long, long, long, float*) [0 - 20002647f8b0] 2.507090 {4}{gpu}: duplicate registration of function void at::native::col2im_kernel(long, double const*, long, long, long, long, long, long, long, long, long, long, long, long, long, double*) [0 - 20002647f8b0] 2.508191 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, c10::BFloat16 const*, long, long, long, long, long, long, long, long, long, long, long, long, c10::BFloat16*) [0 - 20002647f8b0] 2.508200 {4}{gpu}: duplicate registration of function void at::native::vol2col_kernel(int, c10::BFloat16 const*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, c10::BFloat16*) [0 - 20002647f8b0] 2.508208 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, c10::Half const*, long, long, long, long, long, long, long, long, long, long, long, long, c10::Half*) [0 - 20002647f8b0] 2.508213 {4}{gpu}: duplicate registration of function void at::native::vol2col_kernel(int, c10::Half const*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, c10::Half*) [0 - 20002647f8b0] 2.508221 {4}{gpu}: duplicate registration of function void at::native::col2im_kernel(long, float const*, long, long, long, long, long, long, long, long, long, long, long, long, long, float*) [0 - 20002647f8b0] 2.508243 {4}{gpu}: duplicate registration of function void at::native::vol2im_kernel(unsigned int, float const*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, float*) [0 - 20002647f8b0] 2.508264 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, float const*, long, long, long, long, long, long, long, long, long, long, long, long, float*) [0 - 20002647f8b0] 2.508270 {4}{gpu}: duplicate registration of function void at::native::vol2col_kernel(int, float const*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, float*) [0 - 20002647f8b0] 2.508295 {4}{gpu}: duplicate registration of function void at::native::col2im_kernel(long, double const*, long, long, long, long, long, long, long, long, long, long, long, long, long, double*) [0 - 20002647f8b0] 2.508316 {4}{gpu}: duplicate registration of function void at::native::vol2im_kernel(unsigned int, double const*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, double*) [0 - 20002647f8b0] 2.508336 {4}{gpu}: duplicate registration of function void at::native::im2col_kernel(long, double const*, long, long, long, long, long, long, long, long, long, long, long, long, double*) [0 - 20002647f8b0] 2.508356 {4}{gpu}: duplicate registration of function void at::native::vol2col_kernel(int, double const*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, double*) [0 - 20002647f8b0] 2.510665 {4}{gpu}: duplicate registration of function void at_cuda_detail::cub::EmptyKernel() ```

I don't know what the at namespace is, but I assume it's part of PyTorch. Therefore this seems to be slightly different than #824 which reported duplicates in Thrust.

I'm not sure where to go for debugging this one, would appreciate a response from someone who knows the Realm CUDA module.

CharlelieLrt commented 1 year ago

I believe the at namespace is Pytorch's ATen tensor library.

eddy16112 commented 1 year ago

You can try to either use the cpu-only pytorch or disable the cuda hijack by setting -DLegion_USE_CUDART_HIJACK=OFF

elliottslaughter commented 1 year ago

FYI, we may need to do something to make Regent work with the hijack disabled (assuming this is a combined Regent+Pygion code). I'll need @magnatelee to remind me of the details.

CharlelieLrt commented 1 year ago

@eddy16112 thanks for the suggestions. I have been using CPU-only Pytorch in this context quite a lot and it is working just fine... but CPU-only is now insufficient for what I want to do.

@elliottslaughter yes it is a Regent + Pygion code.

eddy16112 commented 1 year ago

@elliottslaughter we can use these two calls to get a cuda stream for launching tasks https://gitlab.com/StanfordLegion/legion/-/blob/master/test/realm/task_stream.cc#L51-52. It should avoid the context sync that are used when disabling cuda hijack.

CharlelieLrt commented 1 year ago

Just a precision (if that matters): I do not execute any Pytorch kernel on GPUs shared with Legion. Pytorch kernels are only executed on dedicated GPUs (with the -ll:skigpus option)

elliottslaughter commented 1 year ago

You can go ahead and try Regent without the hijack if you want. I don't remember exactly what the issue is, but I am guessing it will break, and for reasons unrelated to GPU usage / -ll:skipgpus. But without trying it, we won't know.

CharlelieLrt commented 1 year ago

Ok I'll try without the hijack. Is it just a runtime flag or do I need to re-build Legion with the hijack disabled?

eddy16112 commented 1 year ago

You have to re-build Legion. If you are using cmake, you can disable hijack by setting -DLegion_HIJACK_CUDART=OFF

CharlelieLrt commented 1 year ago

I made a slightly different reproducer to reflect the fact that the code giving these warnings is a Regent + Pygion code (and also to test the effect of disabling the Cuda hijack on the Regent part of the code).

main.rg:

import 'regent'

local nb_gpus = 3 
local format = require("std/format")

__demand(__leaf)
extern task ReductionPytorch(input: region(ispace(int1d), double)): double
where
   reads(input)
end
ReductionPytorch:set_task_id(9000)

__demand(__cuda, __leaf)
task ReductionRegent(input: region(ispace(int1d), double),
                     output: region(ispace(int1d), double))
where
   reads(input),
   reads writes(output)
do
   __demand(__openmp)
   for c in input do
      output[output.bounds.lo] += input[c]
   end 
end

__demand(__leaf)
task checkSum(output: region(ispace(int1d), double))
where
   reads(output)
do
   for c in output do
      format.println("Regent sum at {} = {}", c, output[c])
   end 
end

__demand(__inner)
__forbid(__replicable)
task main()

   var points_input_Rg = ispace(int1d, {32 * nb_gpus})
   var points_output_Rg = ispace(int1d, {nb_gpus})
   var tiles_Rg = ispace(int1d, {nb_gpus})
   var input_Rg = region(points_input_Rg, double)
   var output_Rg = region(points_output_Rg, double)
   var tiles_input_Rg = partition(equal, input_Rg, tiles_Rg)
   var tiles_output_Rg = partition(equal, output_Rg, tiles_Rg)

   var points_input_Py = ispace(int1d, {32})
   var input_Py = region(points_input_Py, double)
   var output_Py: double

   __demand(__index_launch)
   for c in tiles_Rg do
      fill((tiles_input_Rg[c]), 1.0)
   end 

   __demand(__index_launch)
   for c in tiles_Rg do
      fill((tiles_output_Rg[c]), 0.0)
   end 

   fill(input_Py, 1.0)

   __demand(__index_launch)
   for c in tiles_Rg do
      ReductionRegent(tiles_input_Rg[c], tiles_output_Rg[c])
   end 

   output_Py = ReductionPytorch(input_Py)

   __demand(__index_launch)
   for c in tiles_Rg  do  
      checkSum(tiles_output_Rg[c])
   end 

end

regentlib.start(main)

python_main.py:

from pygion import task, R, Region, float64
import torch

@task(task_id=9000,
      argument_types=[Region],
      privileges=[R],
      return_type=float64,
      leaf=True,
      calling_convention='regent')
def ReductionPytorch(input_Py):
    device = torch.device("cuda:0")
    data = torch.from_numpy(input_Py.__value)
    data = data.to(device)
    s = torch.sum(data)
    output_Py = s.item()
    print(f"Pytorch sum = {output_Py}", flush=True)
    return output_Py

I ran the reproducer on a single node of a machine with 4 GPUs/node with: regent.py main.rg -ll:py 1 -ll:pyimport python_main -ll:cpu 1 -ll:gpu 3 -cuda:skipgpus 1

1. With Cuda hijack enabled: I am getting all the duplicate registration warnings above. Then the Python part of the code fails with:

[0 - 20004276f8b0]   12.134474 {6}{cudart}: cudaGetDevice() called outside CUDA task

2. With Cuda hijack disabled: The warnings above don't occur and the code completes successfully.

Is the point 2 above expected? I thought Regent would not work without the Cuda hijack?

elliottslaughter commented 1 year ago

It's been a while since I looked so I suppose Regent works without the hijack now. Good to know.

I think there is still a potential performance issue around setting the correct stream, so let's keep this open for now. But it should (hopefully) be usable for you.

StanfordLegion / legion

Pygion: import torch gives 'duplicate registration of function' #1501