Closed yuweihao closed 4 months ago
Thank you for your interest. Since the trained TransNeXt model can use various modes for inference, the FLOPs can significantly vary depending on the choice between position bias interpolation or extrapolation and whether the linear complexity inference mode is enabled. Therefore, the situation is quite complex.
For the TransNeXt model trained on the ADE20K dataset using the UPerNet method at a fixed resolution of 512x512, there is no choice between position bias interpolation and extrapolation. The FLOPs calculated using fvcore
at a resolution of 512x512 are as follows:
Model | FLOPs |
---|---|
TransNeXt-Tiny | 245G |
TransNeXt-Small | 273G |
TransNeXt-Base | 319G |
When performing inference at higher resolutions, such as 512x2048, position bias needs to be either interpolated or extrapolated. The FLOPs for the extrapolation scheme are:
Model | FLOPs |
---|---|
TransNeXt-Tiny | 1029G |
TransNeXt-Small | 1187G |
TransNeXt-Base | 1401G |
For the interpolation scheme at the same resolution:
Model | FLOPs |
---|---|
TransNeXt-Tiny | 1037G |
TransNeXt-Small | 1201G |
TransNeXt-Base | 1422G |
Clearly, the interpolation scheme results in more FLOPs due to the extra bilinear interpolation operations on the position bias calculated at 800x800 resolution and interpolated to the actual resolution.
The current implementation of TransNeXt uploaded on GitHub shows that the extrapolation scheme runs much faster than the interpolation scheme, and it is recommended for future use for better model performance and faster model speed.
It is worth noting that the reported FLOPs above are in Normal
mode, where the model has quadratic complexity. For larger resolutions such as 512x2048, FLOPs are significantly higher than pure convolution models and Mamba series models. However, TransNeXt also has a linear complexity inference mode. To achieve this, we only need to fix the pool size to (16,16), enforcing the pool size used during 512x512 training. This makes the aggregated attention layers in stages 1-3 have linear complexity. The FLOPs (using the position bias extrapolation scheme) are:
Model | FLOPs |
---|---|
TransNeXt-Tiny | 978G |
TransNeXt-Small | 1089G |
TransNeXt-Base | 1268G |
I conducted a simple experiment by modifying the pool size calculation to:
H_pool, W_pool = min(H // sr_ratio, 16), min(W // sr_ratio, 16)
This slightly sacrifices the performance of MS+ inference. The multi-scale evaluation results are:
Model | MS+ (mIoU) |
---|---|
TransNeXt-Tiny | 51.2 |
TransNeXt-Small | 52.3 |
TransNeXt-Base | 53.4 |
For Mask R-CNN on the COCO dataset, using fvcore
for measurements:
For inference with extrapolation at resolution 1280x800:
Model | FLOPs |
---|---|
TransNeXt-Tiny | 349G |
TransNeXt-Small | 501G |
TransNeXt-Base | 709G |
For inference with interpolation at resolution 1280x800:
Model | FLOPs |
---|---|
TransNeXt-Tiny | 356G |
TransNeXt-Small | 516G |
TransNeXt-Base | 728G |
What About Linear Complexity Mode for Mask R-CNN?
Since Mask R-CNN requires dynamic resolution training, there is currently no single-scale training and larger-scale linear inference scheme similar to ADE20K. How to customize the linear inference strategy of the pretrained model to achieve better results requires further experimentation, and using the linear mode during the training phase might be a better approach.
Due to earlier experiments with detection models, the code for generating position coordinates was slower at the time. Thus, the published weights were trained with the interpolation scheme, and the performance evaluated with the extrapolation scheme does not show a significant improvement, but is rather close to the interpolation scheme's performance. Theoretically, COCO detection models trained with the extrapolation scheme would perform better than currently published models. If computational resources allow in the future, we may consider updating the model weights trained with the extrapolation scheme.
Overall, TransNeXt's aggregated attention can achieve both global receptive fields and linear complexity in the linear mode, similar to Mamba-type vision models, but their inherent visual priors are not the same. Therefore, if possible, I hope the FLOPs and results of ADE20K under the linear inference mode can also be included in the citation, as I believe this comparison would be more interesting and more convincing. Additionally, the extrapolation scheme will be the main strategy in the future, as it provides better performance and significantly faster runtime. Future updates may include models trained with this scheme for further improved accuracy. The interpolation scheme is mostly due to some old code reasons.
@DaiShiResearch, thank you so much for your detailed and helpful response. I will correct the typo of TransNeXt-Base in Table 3.
Hi @DaiShiResearch , could I confirm some details?
For results in Table 12, are they extrapolation or inference?
For results in Table 14, which results are extrapolation, the former or the latter number in "+MS" column?
Thank you so much!
For the results in Table 12, they are trained and inferred under the interpolation scheme. Due to the slow spatial coordinate generation in the old code, all our detection models were trained under the interpolation scheme. The reported results also use interpolation, which can be seen in the validation section of the released log files.
In Table 14 for the MS+ column on ADE20K, the reported results follow the interpolation/extrapolation reporting method. The extrapolated performance is better, as it theoretically should be.
The single-scale mIoU results in Table 14 are the results of the linear mode. The previous response indicated that the code has no impact on the 512x512 single-resolution evaluation, with the pool_size constrained to 16x16 or below, where the pool_size in 512x512 is exactly 16x16. Thus, the normal mode and the linear mode are equivalent at this time.
A flexible feature of TransNeXt is that after training, you can manually set the pool_size limit of the linear mode for inference to achieve any desired trade-off. For example, setting it to 12x12 or 8x8 for lower FLOPs inference is possible, but using a lower pool_size than during training can cause performance degradation in single-resolution evaluation.
Some corrections:
Potential FLOPs optimization:
Hi @DaiShiResearch , thank you so much for your prompt and detailed response. The Linear Complexity Mode for UperNet on ADE20K is really promising, and I would like to cite its results in our paper to compare with other models under comparable FLOPs. If convenient, could you please add the results of Linear Complexity Mode for UperNet on ADE20K in your next arXiv version so that researchers can easily find the original results?
Besides, do you plan to attend CVPR in Seattle? I am wondering whether I have chance to say hello to you.
Thank you for your recognition. We will consider showcasing the detailed performance of the Linear Complexity Mode on segmentation and detection tasks in future versions of our paper. Due to visa issues, I won't be able to attend CVPR in Seattle this year in person. If possible, please send me your WeChat ID via email to stay in touch.
Hi @DaiShiResearch, Thanks a lot for your awesome TransNeXt, which performs very well on various tasks. Could you kindly offer the FLOPs of detection and segmentation tasks (especially Table 12 and Table 14) so that we can cite them in our paper? Thank you so much for your help!