NVlabs / MambaVision

Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone
https://arxiv.org/abs/2407.08083
Other
704 stars 40 forks source link

Did you guys use S6 layer or the S4 layer? #1

Closed siddagra closed 1 month ago

siddagra commented 2 months ago

I just had a tiny question. Did you guys use S6 layer or the S4 layer? Typically mamba itself uses S6 so I was wondering.

ahatamiz commented 2 months ago

Yes, we used S6 which leverages the power of "selectivity".

It would be interesting to try S4 as well at this point. But our main contribution is how to use Mamba for vision, in an efficient manner (throughput-wise) and still retain SOTA performance.

siddagra commented 2 months ago

Thanks for the information :)

Is the paper released anywhere? I found many papers that have similar names, but they don't seem to be from Nvidia and don't have the same name.

The reason I asked my original S4/S6 question was because of this section on the mamba paper:

No Free Lunch: Continuous-Discrete Spectrum. Structured SSMs were originally defined as discretizations of continuous systems (1), and have had a strong inductive bias toward continuous-time data modalities such as perceptual signals (e.g. audio, video). As discussed in Sections 3.1 and 3.5, the selection mechanism overcomes their weaknesses on discrete modalities such as text and DNA; but this conversely can impede their performance on data that LTI SSMs excel on. Our ablations on audio waveforms examine this tradeoff in more detail.

image

The graph shows that for audio data they actively benefited from removing some of the "selective" mechanisms as well as using older S4 SSM layer, as S6, according to them, is more tuned for discrete data as opposed to continuous data. If you have any additional insights on the same, I would love to hear them.

They also state:

However, on the right side, we keep the outer layers of the U-Net Mamba-S4 and ablate only the inner layers. The performance differences shrink dramatically; this reinforces the hypothesis that layers closer to the raw audio signal should be LTI, but once they are “tokenized” and compressed by the outer layers, the inner layers no longer need to be LTI.

Hence, I thought it would be a good idea to try something with S4 or S5 instead, since images are also a form of continuous data compared to text (which is what the Mamba S6 is designed for). You may try it out if you like; however, I am unsure how well it would work once you tokenise/use patch embedding, as they may be more discretised rather than continuous after that step. I would love to collaborate on some formulation that perhaps exploits this property if you are also interested.

siddagra commented 2 months ago

found the paper

ahatamiz commented 2 months ago

Hi @siddagra , thanks for the note. This is indeed an interesting avenue to further investigate.

Images are still tokenized in our instance, but using S4 or S5 may actually be beneficial due to their continuous nature.

Our goal in MambaVision was to create the foundation for models that use Mamba formulation but can be practical as well -- SOTA performance, fast throughput and no need for any more specialized kernels than the Mamba itself.

Indeed, additional investigations on layer types such as S4 or S5 would further propel this line of work, especially with a tailored design for vision applications.

Kind Regards