Is there any analysis on why APs so low?

There are several potential reasons. First, the feature maps that we feed to the transformer are divided into a uniform grid of N*N, and a grid cell only predicts an object instance, which may miss some small objects. Also, the transformer can better build long-range dependencies and capture global features, thus leading to excellent performance on larger objects, but it neglects small objects and local information to a certain extent. Furthermore, the relative low-resolution feature maps P5 with positional information are obtained from the transformer module and combined with P2-P4 in FPN to generate final masks, making it harder for the model to segment small objects precisely. Finally, APs will be poor without using bbox information.

We expect that future work will improve this aspect. I believe I have answered your question, and as such I'm closing the issue, but let us know if you have further questions.

easton-cau / SOTR

Is there any analysis on why APs so low? #11