Fixing and improving indexing type handling

mmigdal-nv commented 1 year ago

Fixed issues:

Recompiling kernel if KernelArgumentHolder's indexing mode changes.
Taking into account the output tensors to update indexing mode.
Current indexing type is appended to kernelName() so we can use KernelDb with the key kernel_code_. Currently KernelDb ignores the wrapped code (#defines, runtime library, ...) and relies only on the kernel. Without changing the kernel name we would be getting back the wrong cubins.

Improvements:

Allowing to change tensor indexing mode in KernelArgumentHolder retroactively.
The -1 in collectIndexMode is misleading. In the case of a 1D tensor, having a type that holds the tensor's index is not enough - we need to be able to hold the bound itself (so we can compare index to the bound without overflows).

Changes:

cparams.index_type is not set to DataType::Index so the kernel can be lowered once and we update/set nvfuser_index_t after, as required.

mmigdal-nv commented 1 year ago

In the case of matmuls, this happens to fix the cases where:

MNK = 65536, 65536, 128, as the output shape was never taken into account (and overflowed nvfuser_index_t)
Crash when problem launched with small input tensors, followed by large input tensors -(overflow in nvfuser_index_t as we don't recompile even if we compute the right size in that case.
Perf impact of running a large problem (that required 64b indexing), followed by smalls, as we will be running 64b kernels for small problems

naoyam commented 1 year ago

As I mentioned to @mmigdal-nv, I think the fix of this PR is sufficient. As long as a fusion is executed through FusionExecutorCache, we should not see back-and-forth recompilations due to index mode changes. The only request I have for @mmigdal-nv is to add a simple C++ test that verifies this behavior. https://github.com/csarofeen/pytorch/pull/2522#discussion_r1119341798

csarofeen / pytorch

Fixing and improving indexing type handling #2522