Some questions regarding the DiscoBox paper

tianyufang1958 commented 1 year ago

Thanks for the nice work and it runs smoothly with Docker. I have some questions with paper and hope you can give me some help.

You mention both YOLACT++ and SOLO V2, but it is not clear which one you use in Figure 2, does it mean Discobox can use either of them as long as it has mask head?
In section 3.2, fi and fk represent the ROI features of pixel i, could you please clarify what ROI means? Is the mask area within the bounding box? Also what the features are? RGB values and spatial information?
For the Structured teach, Tc is the cross image potentials, does it mean the comparison of one boxing box with all other bboxes in other images with the same label?
For the self-ensembling loss, you mentioned self consistency between our task and teach networks were calculated, but I am not clear how the Lnce? Does it compare the the masks features within bounding boxes across the teacher and task network?
In structure teacher, Gibbs energy was defined with unary potentials, pair potentials and cross-image pair potentials, but in the learning section, the loss function does not have them. So question is how the learning can correlate with it and minimise the energy? Apology I am not very familiar with standard mean field.

Chrisding commented 1 year ago

Hi @tianyufang1958

Figure 2 is meant to be a higher level abstraction for both YOLACT++ and SOLOv2. These two methods are somewhat similar at high-level (that they adopt a two-branch like architecture) despite the detailed differences.
ROI means region of interest, which indicates the bounding box region of an object. The features are from the feature map of FPN used in DiscoBox. The pixel colors are indicated as "I" instead in the paper. Please also refer to the attached image for details.
Yes it's with other bboxes with the same label. But because there are too many of them and we don't want to repeat the computation of features, we store RoI features to a constantly updating memory bank and directly retrieve a subset of them from the memory bank whenever needing to form an intra-class pair.
Lnce basically uses the dense correspondences to obtain positive and negative pairs of features from two bboxes with the same label, pull close features from positive pairs and push away features from negative pairs.
Inference and learning are separate. Inference is only responsible to create the structured teacher through minimizing the Gibbs energy. By minimizing the energy, you obtain the structurally refined masks and dense correspondences. Then they are used in learning, which is only for the teacher-student distillation part.

tianyufang1958 commented 1 year ago

@Chrisding Could you please also let me know what dense correspondence means here?

NVlabs / DiscoBox