I've tested this code out on two different V100s when building with the TCNN_CUDA_ARCHITECTURES=70 flag and run into this issue. Building and testing this code out on a 3090 with compute 86 seems to work fine. Building against both architectures with TCNN_CUDA_ARCHITECTURES=70;86 causes the code to fail with the same error (referencing mma_tensor_op_tile_iterator_sm70.h) on the 3090.
When running the repro below, I get a Warp Misaligned Address Exception:
With the following output in cuda-gdb:
I've tested this code out on two different V100s when building with the
TCNN_CUDA_ARCHITECTURES=70
flag and run into this issue. Building and testing this code out on a 3090 with compute 86 seems to work fine. Building against both architectures withTCNN_CUDA_ARCHITECTURES=70;86
causes the code to fail with the same error (referencing mma_tensor_op_tile_iterator_sm70.h) on the 3090.