PyTorch implementation of Infini-Transformer from "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" (https://arxiv.org/abs/2404.07143)
Isn’t is unnecessary to perform q, k, v projections inside the loop? Due to data copying during the calculation process this makes the whole attention operation slower. Ideally this whole thing would be done with a fused kernel … future personal project of mine.
Isn’t is unnecessary to perform q, k, v projections inside the loop? Due to data copying during the calculation process this makes the whole attention operation slower. Ideally this whole thing would be done with a fused kernel … future personal project of mine.