Open scxiao opened 2 years ago
So we dont need to optimize index calculations since the JIT will do that very efficiently due to the lens and strides being constant. There can be a speed bump just switching to JIT for contiguous, see for {1024, 12, 128, 64}, {98304, 64, 8192, 1}
input:
gpu::code_object[code_object=13736,symbol_name=kernel,global=614400,local=1024,]: 2.43447ms
gpu::contiguous: 1.41236m
We can probably get a speed bump by doing tiling like in the paper, which is similar to the preload
we use for broadcast shapes except we wont be copying the entire tensor to LDS. And we can apply this optimization to all pointwise operators not just contiguous.
We are optimizing the bert performance now and one important aspect is the contiguous op. Our current implementation for the transpose input shape in contiguous uses a straightforward approach, but performance is bad for now. In the bert model. the contiguous occupies about 12% of the total time, but it only to change memory layout for tensors. By looking deeps, the contiguous is mainly to do transpose of tensors. So we need a faster implementation of the contiguous for transpose.
By checking online, one thing we can refer is: https://github.com/ap-hynninen/cutt and the corresponding paper is: https://arxiv.org/pdf/1705.01598.pdf