Closed ftxj closed 1 year ago
Thanks for the PR. Looks like the implication of the changes seems to be rather significant and doesn't seem to align well with the current system. This is not your fault at all, but it clearly exposes one of the major limitations of nvfuser. While I'm glad to see you were able to work around the limitation even for a limited case, I strongly feel we really need to build something more fundamental.
Let me try to see if the change could be minimized.
Actually, while I don't see any test failure, compute-sanitizer shows write errors. Run the test as:
PYTORCH_NVFUSER_DUMP=debug_info PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer ./build/bin/nvfuser_tests --gtest_filter="*FusionScatter2DZerosSelfTvFusion_CUDA"
PRINTING: __tmp_kernel1.cu
check cg_outputs ========= Invalid __global__ write of size 4 bytes
========= at 0x290 in /raid/nmaruyama/debug1/__tmp_kernel1.cu:9192:CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<float, (
int)2>, CudaCodeGen::Tensor<float, (int)2>)
========= by thread (96,0,0) in block (0,0,0)
========= Address 0x7fdbda600f80 is out of bounds
========= and is 337 bytes after the nearest allocation at 0x7fdbda600e00 of size 48 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x304e32]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
Note the error happens with no modification on A100. It seems there's something wrong, probably predicates?
Actually, while I don't see any test failure, compute-sanitizer shows write errors. Run the test as:
PYTORCH_NVFUSER_DUMP=debug_info PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer ./build/bin/nvfuser_tests --gtest_filter="*FusionScatter2DZerosSelfTvFusion_CUDA"
PRINTING: __tmp_kernel1.cu check cg_outputs ========= Invalid __global__ write of size 4 bytes ========= at 0x290 in /raid/nmaruyama/debug1/__tmp_kernel1.cu:9192:CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<float, ( int)2>, CudaCodeGen::Tensor<float, (int)2>) ========= by thread (96,0,0) in block (0,0,0) ========= Address 0x7fdbda600f80 is out of bounds ========= and is 337 bytes after the nearest allocation at 0x7fdbda600e00 of size 48 bytes ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x304e32] ========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
Note the error happens with no modification on A100. It seems there's something wrong, probably predicates?
I will try to fix it.
Actually, while I don't see any test failure, compute-sanitizer shows write errors. Run the test as:
PYTORCH_NVFUSER_DUMP=debug_info PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer ./build/bin/nvfuser_tests --gtest_filter="*FusionScatter2DZerosSelfTvFusion_CUDA"
PRINTING: __tmp_kernel1.cu check cg_outputs ========= Invalid __global__ write of size 4 bytes ========= at 0x290 in /raid/nmaruyama/debug1/__tmp_kernel1.cu:9192:CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<long long, (int)2>, CudaCodeGen::Tensor<float, ( int)2>, CudaCodeGen::Tensor<float, (int)2>) ========= by thread (96,0,0) in block (0,0,0) ========= Address 0x7fdbda600f80 is out of bounds ========= and is 337 bytes after the nearest allocation at 0x7fdbda600e00 of size 48 bytes ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x304e32] ========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
Note the error happens with no modification on A100. It seems there's something wrong, probably predicates?
After a long time, I have fixed this bug..
Since we haven't started reviewing this PR fully yet, can you move this to the new repository? github.com/NVIDIA/Fuser
Closing this one in favor of https://github.com/NVIDIA/Fuser/pull/89
This PR support the codegen for
ScatterOp
when theinputTv
andindexTv
has different size.The
index lowering
andpredicate generate
passes have different behavior betweenScatterOp
and other ops.For example, other ops generate the
loop bounds
,predicate start
by using the extends ofoutput_tv
. But forScatterOp
, we need to use the extent ofindex_tv
. To do this, we modify thecompute_at_map
, to make the concrete_id ofoutput_tv
iteration domain isindex_tv
iteration domain.Due to the
IndexCompute
is very complicated, the implementation of the above logic is very hack. It's not a good way to do this. I think maybe we need a more general method to deal with different shape between different tensor view. I look forward to the view of reviewers in this point.Another thing is that the
ScatterOp
need to initialize theoutput_tv
by using theinput_tv
, so we add a methodgetTensorForFillAnotherTensor
to do this.