csarofeen / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
26 stars 7 forks source link

[ScatterOp 4] ScatterOp IndexLowering (different shape case) #2523

Closed ftxj closed 1 year ago

ftxj commented 1 year ago

This PR support the codegen for ScatterOp when the inputTv and indexTv has different size.

The index lowering and predicate generate passes have different behavior between ScatterOp and other ops.

For example, other ops generate the loop bounds, predicate start by using the extends of output_tv. But for ScatterOp, we need to use the extent of index_tv. To do this, we modify the compute_at_map, to make the concrete_id of output_tv iteration domain is index_tv iteration domain.

Due to the IndexCompute is very complicated, the implementation of the above logic is very hack. It's not a good way to do this. I think maybe we need a more general method to deal with different shape between different tensor view. I look forward to the view of reviewers in this point.

Another thing is that the ScatterOp need to initialize the output_tv by using the input_tv, so we add a method getTensorForFillAnotherTensor to do this.

naoyam commented 1 year ago

Thanks for the PR. Looks like the implication of the changes seems to be rather significant and doesn't seem to align well with the current system. This is not your fault at all, but it clearly exposes one of the major limitations of nvfuser. While I'm glad to see you were able to work around the limitation even for a limited case, I strongly feel we really need to build something more fundamental.

Let me try to see if the change could be minimized.

naoyam commented 1 year ago

Actually, while I don't see any test failure, compute-sanitizer shows write errors. Run the test as:

PYTORCH_NVFUSER_DUMP=debug_info PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer ./build/bin/nvfuser_tests --gtest_filter="*FusionScatter2DZerosSelfTvFusion_CUDA" 
PRINTING: __tmp_kernel1.cu
check cg_outputs ========= Invalid __global__ write of size 4 bytes
=========     at 0x290 in /raid/nmaruyama/debug1/__tmp_kernel1.cu:9192:CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<float, (
int)2>, CudaCodeGen::Tensor<float, (int)2>)
=========     by thread (96,0,0) in block (0,0,0)
=========     Address 0x7fdbda600f80 is out of bounds
=========     and is 337 bytes after the nearest allocation at 0x7fdbda600e00 of size 48 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x304e32]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1

Note the error happens with no modification on A100. It seems there's something wrong, probably predicates?

ftxj commented 1 year ago

Actually, while I don't see any test failure, compute-sanitizer shows write errors. Run the test as:

PYTORCH_NVFUSER_DUMP=debug_info PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer ./build/bin/nvfuser_tests --gtest_filter="*FusionScatter2DZerosSelfTvFusion_CUDA" 
PRINTING: __tmp_kernel1.cu
check cg_outputs ========= Invalid __global__ write of size 4 bytes
=========     at 0x290 in /raid/nmaruyama/debug1/__tmp_kernel1.cu:9192:CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<float, (
int)2>, CudaCodeGen::Tensor<float, (int)2>)
=========     by thread (96,0,0) in block (0,0,0)
=========     Address 0x7fdbda600f80 is out of bounds
=========     and is 337 bytes after the nearest allocation at 0x7fdbda600e00 of size 48 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x304e32]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1

Note the error happens with no modification on A100. It seems there's something wrong, probably predicates?

I will try to fix it.

ftxj commented 1 year ago

Actually, while I don't see any test failure, compute-sanitizer shows write errors. Run the test as:

PYTORCH_NVFUSER_DUMP=debug_info PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer ./build/bin/nvfuser_tests --gtest_filter="*FusionScatter2DZerosSelfTvFusion_CUDA" 
PRINTING: __tmp_kernel1.cu
check cg_outputs ========= Invalid __global__ write of size 4 bytes
=========     at 0x290 in /raid/nmaruyama/debug1/__tmp_kernel1.cu:9192:CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<float, (
int)2>, CudaCodeGen::Tensor<float, (int)2>)
=========     by thread (96,0,0) in block (0,0,0)
=========     Address 0x7fdbda600f80 is out of bounds
=========     and is 337 bytes after the nearest allocation at 0x7fdbda600e00 of size 48 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x304e32]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1

Note the error happens with no modification on A100. It seems there's something wrong, probably predicates?

After a long time, I have fixed this bug..

naoyam commented 1 year ago

Since we haven't started reviewing this PR fully yet, can you move this to the new repository? github.com/NVIDIA/Fuser

naoyam commented 1 year ago

Closing this one in favor of https://github.com/NVIDIA/Fuser/pull/89