PyTorch implementation of Infini-Transformer from "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" (https://arxiv.org/abs/2404.07143)
The MLPs in both transformer modules currently have ReLU hard-coded as the activation function. It would help to have options for nonlinear activations commonly used in recent LLMs (GeLU, SwiGLU, GeGLU, etc.)
The MLPs in both transformer modules currently have ReLU hard-coded as the activation function. It would help to have options for nonlinear activations commonly used in recent LLMs (GeLU, SwiGLU, GeGLU, etc.)