Open LisandraMoura opened 2 months ago
Location in document: undefined
Selected HTML:
Foundation models (FMs), or large models pretrained on massive data then adapted for downstream tasks, have emerged as an effective paradigm in modern machine learning. The backbone of these FMs are often sequence models, operating on arbitrary sequences of inputs from a wide variety of domains such as language, images, speech, audio, time series, and genomics \parencitesutskever2014sequence,dosovitskiy2020image,oord2016wavenet,brown2020language,ismail2019deep,poli2023hyena. While this concept is agnostic to a particular choice of model architecture, modern FMs are predominantly based on a single type of sequence model: the Transformer \parencitevaswani2017attention and its core attention layer \parencitebahdanau2015neural The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length. An enormous body of research has appeared on more efficient variants of attention to overcome these drawbacks \parencitetay2022efficient, but often at the expense of the very properties that makes it effective. As of yet, none of these variants have been shown to be empirically effective at scale across domains.
Recently, structured state space sequence models (SSMs) \parencitegu2021combining,gu2022efficiently have emerged as a promising class of architectures for sequence modeling. These models can be interpreted as a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), with inspiration from classical state space models \parencitekalman1960new. This class of models can be computed very efficiently as either a recurrence or convolution, with linear or near-linear scaling in sequence length. Additionally, they have principled mechanisms for modeling long-range dependencies \parencitegu2020hippo in certain data modalities, and have dominated benchmarks such as the Long Range Arena \parencitetay2021long. Many flavors of SSMs \parencitegu2022efficiently,gupta2022diagonal,gu2022parameterization,li2023makes,ma2023mega,smith2023s5,orvieto2023resurrecting have been successful in domains involving continuous signal data such as audio and vision \parencitegoel2022raw,saon2023diagonal,nguyen2022s4nd. However, they have been less effective at modeling discrete and information-dense data such as text.
Hello @LisandraMoura, thanks for the issue report! We are reviewing your report and will address it as soon as possible.
Description
This excerpt, as well as others in the article Mamba: Linear-Time Sequence Modeling with Selective State Spaces, have rendering errors
(Optional:) Please add any files, screenshots, or other information here.
No response
(Required) What is this issue most closely related to? Select one.
Choose One
Internal issue ID
b720af69-ab6e-467d-b583-79026bae98a5
Paper URL
https://arxiv.org/html/2312.00752v2
Browser
Chrome/128.0.0.0
Device Type
Desktop