Closed CacatuaAlan closed 3 months ago
Actually, the hyperparameters of SSM (including d_state, expand factor, etc.) https://github.com/EasonXiao-888/GrootVL/blob/0857f0a12f6e46873d05e4b9a73d987e97b149c3/GrootV/classification/models/grootv.py#L190 are consistent with most visual mambas (e.g., original VMamba and Local Mamba). Possibly, 96 you point out should be d_model (the channels of the input image feature) of GrootV-Small following most previous visual backbone (like InternImage, Swin-transformer). What's more, thanks for your question. More information about d_state for vision mamba can refer to https://github.com/MzeroMiko/VMamba/issues/206
Sorry about my mistake and thank you for your patient response. What do you think about the relationship between d_state and #tokens?
Hi! Notice that you set the d_state=96 which is different from most vision mamba models. How do you know if d_state is enough?