ZhanYang-nwpu / RSVG-pytorch

RSVG: Exploring Data and Model for Visual Grounding on Remote Sensing Data, 2022
106 stars 4 forks source link

Please explain the difference between your work and Pseudo-Q(CVPR'22) #4

Closed linhuixiao closed 1 year ago

linhuixiao commented 1 year ago

Your work is very similar to that of Pseudo-Q (https://github.com/LeapLabTHU/Pseudo-Q). What are the differences between your work and Pseudo-Q in the following aspects:

  1. What is the essential difference between the method of constructing pseudo labels of remote sensing data mentioned in your paper and Pseudo-Q;

  2. What is the difference between MLCF (Multi-level Cross-modal Fusion) mentioned in your paper and ML-CMA (Multi-level Cross-modal Attention) in Pseudo-Q?

Thanks

ZhanYang-nwpu commented 1 year ago

Your work is very similar to that of Pseudo-Q (https://github.com/LeapLabTHU/Pseudo-Q). What are the differences between your work and Pseudo-Q in the following aspects:

  1. What is the essential difference between the method of constructing pseudo labels of remote sensing data mentioned in your paper and Pseudo-Q;
  2. What is the difference between MLCF (Multi-level Cross-modal Fusion) mentioned in your paper and ML-CMA (Multi-level Cross-modal Attention) in Pseudo-Q?

Thanks

  1. Regarding the difference in dataset construction. The Pseudo-Q deals with the generation of pseudo labels. While RSVG is to construct Remote Sensing Visual Grounding Dataset, which specifically involves the generation of textual descriptions of targets in remote sensing images. We design an automatic RS image-query generation method with manual assistance to construct real labels. Please read my paper (https://arxiv.org/abs/2210.12634) again for the specific method.

  2. Although ML-CMA and MLCF are both called by the name of Multi-level Cross-modal, the starting point and algorithm are essentially very different.

    • (1) Different inputs. a) The input of ML-CMA includes multi-level visual features extracted directly from the multi-layer convolutional blocks inside ResNet50, and textual features extracted directly using BERT. b) But MLCF extracts multi-scale visual features by newly constructed visual branches (ResNet50 and added layers), and extracts the sentence-level and word-level multi-granularity textual features at using BERT. Our multi-scale visual features and multi-granularity textual features are noted as Multi-level Multi-modal Feature.
    • (2) Different approaches for multi-level cross-modal learning. a) ML-CMA calculates the self-attention of image features and textual features at different levels separately, and uses this attention to update image features and textual features, and finally concatenates the updated image features and textual features together, which is denoted as fusion feature A. Among them, scaled dot- product attention is used to calculate the attention of image features and the attention of textual features respectively. Then each fusion feature A of different levels is concatenated together to obtain the final fusion feature B, and finally B is used for prediction. b) The structure of MLCF is composed of L-layer cross-attention and N-layer self-attention. Firstly, multi-scale visual features and multi-granularity textual features are concatenated to get multi-level multi-modal features A, and the visual features to be updated are noted as features B. The MLCF module inputs A and B and outputs the updated visual features B. The updated B is then concatenated with word-level textual features and input into the localization module for prediction. In the first stage of the MLCF module, A is used as the Query and Key of cross-attention, and B is used as the Value of cross-attention. The multi-level multi-modal information is used to guide the visual features for refinement update and achieve multi-level cross-modal feature learning. The second stage of the MLCF module uses self-attention to discover the relationship between pixels of feature map and learn more discriminative visual features.
    • (3) Different motivations for research a) ML-CMA: Many of previous methods utilize final features of visual and language encoders to acquire cross-modality information. However, these approaches are suboptimal, since each level of visual feature possesses valuable semantic information. To be more specific, low-level features usually denote coarse information, e.g., shape and edge, while high-level features can represent finer information, e.g., intrinsic object properties. Thus, ML-CMA can thoroughly fuse textual embeddings with multi-level visual features. b) MLCF Remotely-sensed images are usually with large scale variations and cluttered backgrounds. To deal with the scale-variation problem, the MLCM module takes advantage of multi-scale visual features and multi-granularity textual embeddings to learn more discriminative representations. To cope with the cluttered background problem, MLCM adaptively filters irrelevant noise and enhances salient features. In this way, our proposed model can incorporate more effective multi-level and multi-modal features to boost performance. Firstly, unlike natural scene images, RS images are gathered from an overhead view by satellites, which results in large scale variations and cluttered backgrounds. Due to the characteristics, the model for solving RS tasks has to consider multi-scale inputs. The methods on natural images fail to fully take account of multi-scale features, which leads to suboptimal results on RS imagery. In addition, the background content of RS images contains numerous objects unrelated to the query, but natural images generally have salient objects. Due to the lack of filtering redundant features, the previous models are difficult to understand RS image-expression pairs. Therefore, we attempt to design a network that includes multi-scale fusion and adaptive filtering functions to refine visual features. Second, the previous frameworks that extract visual and textual features isolatedly do not conform to human perceptual habits, and such visual features lack the effective information needed for multi-modal reasoning. Inspired by the above discussion, we address the problem of how to learn fine-grained semantically salient image representations under multi-scale visual feature inputs. Based on cross-attention mechanism, MLCM module first utilizes multi-scale visual features and multi-granularity textual embeddings to guide the visual feature refining and achieve multi-level cross-modal feature learning. Considering that objects in an RS image are usually correlated, e.g., stadiums usually co-occur with ground track fields, MLCM discovers the relations between object regions based on self-attention mechanism. Specifically, our MLCM includes multi-level cross-modal learning and self-attention learning.
ZhanYang-nwpu commented 1 year ago

Your work is very similar to that of Pseudo-Q (https://github.com/LeapLabTHU/Pseudo-Q). What are the differences between your work and Pseudo-Q in the following aspects:

  1. What is the essential difference between the method of constructing pseudo labels of remote sensing data mentioned in your paper and Pseudo-Q;
  2. What is the difference between MLCF (Multi-level Cross-modal Fusion) mentioned in your paper and ML-CMA (Multi-level Cross-modal Attention) in Pseudo-Q?

Thanks

Thanks a lot for your attention and kind reminder.

linhuixiao commented 1 year ago

Although your work focuses on remote sensing data, the method you use is too similar to that of Pseudo-Q, and readers have reasons to think that your ideas are borrowed from Pseudo-Q. I suggest that you quote the work of Pseudo-Q in the final version of your paper and discuss the differences, otherwise it may be considered plagiarism. Kind regards.