About the FLOPs of detection and segmenation

yuweihao commented 4 months ago

Hi @DaiShiResearch, Thanks a lot for your awesome TransNeXt, which performs very well on various tasks. Could you kindly offer the FLOPs of detection and segmentation tasks (especially Table 12 and Table 14) so that we can cite them in our paper? Thank you so much for your help!

DaiShiResearch commented 4 months ago

Thank you for your interest. Since the trained TransNeXt model can use various modes for inference, the FLOPs can significantly vary depending on the choice between position bias interpolation or extrapolation and whether the linear complexity inference mode is enabled. Therefore, the situation is quite complex.

ADE20K Dataset Using UPerNet Method

Quadratic Complexity Mode

For the TransNeXt model trained on the ADE20K dataset using the UPerNet method at a fixed resolution of 512x512, there is no choice between position bias interpolation and extrapolation. The FLOPs calculated using fvcore at a resolution of 512x512 are as follows:

Model	FLOPs
TransNeXt-Tiny	245G
TransNeXt-Small	273G
TransNeXt-Base	319G

When performing inference at higher resolutions, such as 512x2048, position bias needs to be either interpolated or extrapolated. The FLOPs for the extrapolation scheme are:

Model	FLOPs
TransNeXt-Tiny	1029G
TransNeXt-Small	1187G
TransNeXt-Base	1401G

For the interpolation scheme at the same resolution:

Model	FLOPs
TransNeXt-Tiny	1037G
TransNeXt-Small	1201G
TransNeXt-Base	1422G

Clearly, the interpolation scheme results in more FLOPs due to the extra bilinear interpolation operations on the position bias calculated at 800x800 resolution and interpolated to the actual resolution.

The current implementation of TransNeXt uploaded on GitHub shows that the extrapolation scheme runs much faster than the interpolation scheme, and it is recommended for future use for better model performance and faster model speed.

Linear Complexity Mode

It is worth noting that the reported FLOPs above are in Normal mode, where the model has quadratic complexity. For larger resolutions such as 512x2048, FLOPs are significantly higher than pure convolution models and Mamba series models. However, TransNeXt also has a linear complexity inference mode. To achieve this, we only need to fix the pool size to (16,16), enforcing the pool size used during 512x512 training. This makes the aggregated attention layers in stages 1-3 have linear complexity. The FLOPs (using the position bias extrapolation scheme) are:

Model	FLOPs
TransNeXt-Tiny	978G
TransNeXt-Small	1089G
TransNeXt-Base	1268G

I conducted a simple experiment by modifying the pool size calculation to:

H_pool, W_pool = min(H // sr_ratio, 16), min(W // sr_ratio, 16)

This slightly sacrifices the performance of MS+ inference. The multi-scale evaluation results are:

Model	MS+ (mIoU)
TransNeXt-Tiny	51.2
TransNeXt-Small	52.3
TransNeXt-Base	53.4

COCO Dataset Using Mask R-CNN Method

Quadratic Complexity Mode

For Mask R-CNN on the COCO dataset, using fvcore for measurements:

For inference with extrapolation at resolution 1280x800:

Model	FLOPs
TransNeXt-Tiny	349G
TransNeXt-Small	501G
TransNeXt-Base	709G

For inference with interpolation at resolution 1280x800:

Model	FLOPs
TransNeXt-Tiny	356G
TransNeXt-Small	516G
TransNeXt-Base	728G

What About Linear Complexity Mode for Mask R-CNN?

Since Mask R-CNN requires dynamic resolution training, there is currently no single-scale training and larger-scale linear inference scheme similar to ADE20K. How to customize the linear inference strategy of the pretrained model to achieve better results requires further experimentation, and using the linear mode during the training phase might be a better approach.

Due to earlier experiments with detection models, the code for generating position coordinates was slower at the time. Thus, the published weights were trained with the interpolation scheme, and the performance evaluated with the extrapolation scheme does not show a significant improvement, but is rather close to the interpolation scheme's performance. Theoretically, COCO detection models trained with the extrapolation scheme would perform better than currently published models. If computational resources allow in the future, we may consider updating the model weights trained with the extrapolation scheme.

Conclusion

Overall, TransNeXt's aggregated attention can achieve both global receptive fields and linear complexity in the linear mode, similar to Mamba-type vision models, but their inherent visual priors are not the same. Therefore, if possible, I hope the FLOPs and results of ADE20K under the linear inference mode can also be included in the citation, as I believe this comparison would be more interesting and more convincing. Additionally, the extrapolation scheme will be the main strategy in the future, as it provides better performance and significantly faster runtime. Future updates may include models trained with this scheme for further improved accuracy. The interpolation scheme is mostly due to some old code reasons.

Friendly reminder: I found an error in your paper—Table 3 lists the parameter count for TransNeXt-Base incorrectly. Please update it accordingly.

yuweihao commented 4 months ago

@DaiShiResearch, thank you so much for your detailed and helpful response. I will correct the typo of TransNeXt-Base in Table 3.

yuweihao commented 4 months ago

Hi @DaiShiResearch , could I confirm some details?

For results in Table 12, are they extrapolation or inference?
For results in Table 14, which results are extrapolation, the former or the latter number in "+MS" column?

Could you also offer single-scale mIoU results for your mentioned Linear Complexity Mode on ADE20K of UperNet?

Thank you so much!

DaiShiResearch commented 4 months ago

For the results in Table 12, they are trained and inferred under the interpolation scheme. Due to the slow spatial coordinate generation in the old code, all our detection models were trained under the interpolation scheme. The reported results also use interpolation, which can be seen in the validation section of the released log files.
- The essence of the interpolation scheme is to pre-generate spatial coordinates & calculate relative position biases at a resolution of 800x800 and then bilinearly interpolate to the target resolution, such as 1280x640. This can result in some unnatural visual priors because the spatial coordinates seen by the model during training are compressed or stretched in either the length or width. Therefore, when the model trained under this distorted spatial information is inferred using the extrapolation scheme, where it suddenly sees normal spatial coordinates, there is no overall improvement similar to the improvement observed when switching to the extrapolation scheme for the ADE20K task. This is why we mention the need to update the extrapolation scheme detection weights in the future.
In Table 14 for the MS+ column on ADE20K, the reported results follow the interpolation/extrapolation reporting method. The extrapolated performance is better, as it theoretically should be.
The single-scale mIoU results in Table 14 are the results of the linear mode. The previous response indicated that the code has no impact on the 512x512 single-resolution evaluation, with the pool_size constrained to 16x16 or below, where the pool_size in 512x512 is exactly 16x16. Thus, the normal mode and the linear mode are equivalent at this time.
- This is similar to the multi-resolution results of the ImageNet model at 224x224 reported in Table 5 and Figure 6, where the normal mode and the linear mode share the same starting point. The performance degradation due to the interpolation scheme can also be observed in Figure 6.
A flexible feature of TransNeXt is that after training, you can manually set the pool_size limit of the linear mode for inference to achieve any desired trade-off. For example, setting it to 12x12 or 8x8 for lower FLOPs inference is possible, but using a lower pool_size than during training can cause performance degradation in single-resolution evaluation.
Some corrections:
- I found that the FLOPs of TransNeXt on ADE20K at a 512x512 resolution should be 1-3G lower than previously reported. When the input size of bilinear interpolation is identical to the output size, the operation should be bypassed. However, fvcore unnecessarily computed the bilinear interpolation, causing a higher report.
Potential FLOPs optimization:
- TransNeXt has a lot of computation spent on generating spatial coordinates and computing position biases. In a fixed resolution inference scenario, these operations can be cached as traditional static position biases during model initialization, saving around 10G FLOPs at a 1280x800 resolution.

yuweihao commented 4 months ago

Hi @DaiShiResearch , thank you so much for your prompt and detailed response. The Linear Complexity Mode for UperNet on ADE20K is really promising, and I would like to cite its results in our paper to compare with other models under comparable FLOPs. If convenient, could you please add the results of Linear Complexity Mode for UperNet on ADE20K in your next arXiv version so that researchers can easily find the original results?

Besides, do you plan to attend CVPR in Seattle? I am wondering whether I have chance to say hello to you.

DaiShiResearch commented 4 months ago

Thank you for your recognition. We will consider showcasing the detailed performance of the Linear Complexity Mode on segmentation and detection tasks in future versions of our paper. Due to visa issues, I won't be able to attend CVPR in Seattle this year in person. If possible, please send me your WeChat ID via email to stay in touch.

DaiShiResearch / TransNeXt