What parameters are important to tune when using VRWKV block as drop-in replacement for attention?

yxchng commented 7 months ago

Currently, my model is performing worse when using VRWKV block as drop-in replacement for attention. Do you have any suggestion on what are the important parameters to tune?

duanduanduanyuchen commented 7 months ago

Hi, thanks for your attention to our work! I found it important to set an appropriate drop path rate for VRWKV(usually a little bit higher than ViT). Weight initialization also has some influence on the final result.

yxchng commented 7 months ago

Ok, I will try these suggestions. Another question, have you been experimenting with Mamba as well? Mamba seems to be a more popular architecture compared to RWKV and I wonder if your experiments show RWKV to be better?

BlinkDL commented 7 months ago

RWKV6 is better than Mamba in all tests :) https://arxiv.org/abs/2404.05892 example: MQAR MQAR

BlinkDL commented 7 months ago

Currently, my model is performing worse when using VRWKV block as drop-in replacement for attention. Do you have any suggestion on what are the important parameters to tune?

try VRWKV6

yxchng commented 7 months ago

I am aware of the performance but NLP but I am more asking about for vision. For example, https://github.com/MzeroMiko/VMamba shows much better performance with Mamba. Hopefully, you may share some insights.

an-ys commented 7 months ago

I am aware of the performance but NLP but I am more asking about for vision. For example, https://github.com/MzeroMiko/VMamba shows much better performance with Mamba. Hopefully, you may share some insights.

I should probably file a different issue for this specifically, but I am also interested in knowing how VRWKV compares to Vision Mamba and VMamba since only ViT was used for the baseline in the paper.

BlinkDL commented 7 months ago

I am aware of the performance but NLP but I am more asking about for vision. For example, https://github.com/MzeroMiko/VMamba shows much better performance with Mamba. Hopefully, you may share some insights.

VMamba is using more tricks so they will get better performance if they switch to RWKV6 😂

duanduanduanyuchen commented 7 months ago

@yxchng @an-ys VMamba designed a hierarchical backbone, so it will have much better results than non-hierarchical ones in small-size and using "not-that-large-scale" datasets in the pre-train process (E.g ViT vs PVT or swin). We usually compare the hierarchical and non-hierarchical backbones separately. It is believed that this gap will be eliminated as the increase of the size of the model and the scale of datasets used in the pre-train process in transformer-like models. We focus on the exploration of the linear attention models in the vision field. So we design VRWKV, aiming to have comparable performance and scale-up stability as the most frequently used model(ViT). According to the above reasons, we pay more attention to the stability of model training after scale-up(and use larger datasets in pre-train), which is particularly important in linear attention models (including RWKV and Mamba). Therefore, we proposed large-size models, demonstrating that VRWKV can perform better when keep scaling up(which has not been verified in mamba-based models so far).

OpenGVLab / Vision-RWKV

What parameters are important to tune when using VRWKV block as drop-in replacement for attention? #10