ROCm / AMDMIGraphX

AMD's graph optimization engine.
https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/
MIT License
187 stars 86 forks source link

fast contiguous for tensor transpose #1158

Open scxiao opened 2 years ago

scxiao commented 2 years ago

We are optimizing the bert performance now and one important aspect is the contiguous op. Our current implementation for the transpose input shape in contiguous uses a straightforward approach, but performance is bad for now. In the bert model. the contiguous occupies about 12% of the total time, but it only to change memory layout for tensors. By looking deeps, the contiguous is mainly to do transpose of tensors. So we need a faster implementation of the contiguous for transpose.

By checking online, one thing we can refer is: https://github.com/ap-hynninen/cutt and the corresponding paper is: https://arxiv.org/pdf/1705.01598.pdf

pfultz2 commented 2 years ago

So we dont need to optimize index calculations since the JIT will do that very efficiently due to the lens and strides being constant. There can be a speed bump just switching to JIT for contiguous, see for {1024, 12, 128, 64}, {98304, 64, 8192, 1} input:

gpu::code_object[code_object=13736,symbol_name=kernel,global=614400,local=1024,]: 2.43447ms
gpu::contiguous: 1.41236m

We can probably get a speed bump by doing tiling like in the paper, which is similar to the preload we use for broadcast shapes except we wont be copying the entire tensor to LDS. And we can apply this optimization to all pointwise operators not just contiguous.