Open bonham79 opened 3 months ago
I don't know anything about how these work yet, but they're the only "new architecture" in a long time, so why not. Any reason to think they're more or less applicable to our class of problems?
Their main selling point is being linear memory scaling with token length. For our class of problems that's not really a concern. But it would let us further minimize the memory footprint of architectures, letting us go hog wild with batch sizes and model sizes on lower-level hardware.
Theoretical justification? we've seen LSTMs generally outperform transformers on a lot of our tasks (qua Adam's paper, anti qua Wu). So having an LSTM like model that competes against transformers further allows us to dig our heels into the power of modeling assumptions.
But really my only reason is:
(Lowest of the low priorities)
SSMs have been making the rounds but people have only cared about them for 'major' tasks. (NMT models, speech, LLM). Since they're special LSTMs and we see better performance from that type of model on our type of tasks, may be fun to implement an SSM decoder and try out.
More than theoretical interest, they're supposed to be more memory efficient than transformers, so we can probably run some wicked batch sizes if they're implemented well.