fast contiguous for tensor transpose

ROCm / AMDMIGraphX

AMD's graph optimization engine.

MIT License

187 stars 86 forks source link

So we dont need to optimize index calculations since the JIT will do that very efficiently due to the lens and strides being constant. There can be a speed bump just switching to JIT for contiguous, see for {1024, 12, 128, 64}, {98304, 64, 8192, 1} input:

gpu::code_object[code_object=13736,symbol_name=kernel,global=614400,local=1024,]: 2.43447ms
gpu::contiguous: 1.41236m

We can probably get a speed bump by doing tiling like in the paper, which is similar to the preload we use for broadcast shapes except we wont be copying the entire tensor to LDS. And we can apply this optimization to all pointwise operators not just contiguous.

ROCm / AMDMIGraphX

fast contiguous for tensor transpose #1158