Thank you for your work and perfect theoretical derivation!And I have some questions.
Have you compared the training time with other models, such as vim, and what is the main reason for the longer time? And how about the ablation experiments on the number of the nodes?
Thanks again.
@weilli Thank you for your interest, and I apologize for the delayed response.
In our prior tests on the V100 GPU, our method’s inference throughput was 392 (img/s), compared to 374 for vanilla VMamba. Building a minimum spanning tree introduces time overhead. Initially, we constructed a tree for each block, which resulted in a throughput of 281. Notably, allowing blocks in the same stage to share a tree preserves accuracy and enhances efficiency with 392 (img/s). We provide a detailed comparison in camera-ready version.
Given a sequence with the length of L with an established corresponding minimum spanning tree, for the case of single-vertex setting, we treat it as the root of a tree and aggregate features from other vertices, which operate in o(L) complexity. While for the all-vertices setting, a naive approach treats each vertex as a root separately, resulting in O(L^2) complexity. In contrast, we propose a dynamic programming algorithm where a random vertex is chosen as the root, features are aggregated from leaf vertices to the root, followed by propagation from the root to the leaves, achieving the same effect. For node ablation, please refer to Table 6 in the manuscript.
If helpful, feel free to star ⭐️ the repo ❤️❤️❤️.
Thank you for your work and perfect theoretical derivation!And I have some questions. Have you compared the training time with other models, such as vim, and what is the main reason for the longer time? And how about the ablation experiments on the number of the nodes? Thanks again.