NVlabs / MambaVision

Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone
https://arxiv.org/abs/2407.08083
Other
675 stars 37 forks source link

Questions about Extracting and Fine-Tuning Vectors from MambaVision #15

Closed Adam-Serghini closed 1 month ago

Adam-Serghini commented 1 month ago

Hello,

First, thank you for sharing your work, I find the architecture very interesting and useful.

I have a few questions regarding the extraction and fine-tuning of the features vectors used for classification:

Extracting Vectors:
I have commented out the line responsible for the classification step to obtain the vectors directly. Is this the correct approach to extract these vectors?

Vector Size: I noticed that the size of the vector is 640. Is it possible to reduce the size of this vector? If so, how would that impact the performance of the model?

Fine-Tuning Without Labels: I would like to fine-tune the model with my images, but I do not have associated class labels. Would it be appropriate to use this model as an encoder and attach a decoder after it for this purpose ? If so, could you provide any guidance or best practices for this approach?

Thank you for your assistance and for the excellent work on this project.

ahatamiz commented 1 month ago

Hi @Adam-Serghini , thanks for the question and sorry for this late response.

Regarding feature extraction, output of this line can be used to directly get the average-pooled (flattened) features.

We also recently published models on HuggingFace. You can extract these features simply by following:

from transformers import AutoModel

model = AutoModel.from_pretrained("nvidia/MambaVision-T-1K", trust_remote_code=True)
out_avg_pool, features = model(inputs)

out_avg_pool is the average-pooled features same as above. Additionally, features are the multi-scale features from each stage. Full example is provided in here.

Now regarding the vector size, let's examine it in details. For this, I am using mamba_vision_T as an example.

Assuming a batch size of 1, the output of the last stage is a feature map of size [1, 640, 7, 7]. After applying nn.AdaptiveAvgPool2d(1) and flattening it, it becomes of a (latent) vector of size 640. The 640 is determined by embedding dim of the model which is in this case dim=80. You can further decrease dim and define a new MambaVision network variants, but that results in decreasing the number of parameters which can impact the performance to some degree.

For fine-tuning, the best case example is using the MAE approach. Given only images, the MambaVision (or any backbone for this matter) attempts to reconstructs in a masked auto-encoding fashion. It's a powerful technique that is worth a try.

You can also use other approaches such as SimCLR and its variants which are based on contrastive-learning.

I hope the above helps.