bo-miao / HTR

Towards Temporally Consistent Referring Video Object Segmentation
MIT License
11 stars 2 forks source link

Results of A2D #3

Closed Solo4working closed 1 month ago

Solo4working commented 2 months ago

Thank you for such insightful work~

In my opinion, the contribution of this paper is mainly how to use the reference mask to optimize the mask of the rest of the time, while the visual backbone and segmentation are similar to the ReferFormer. What makes me curious is why it can achieve comparable performance with sgmg on the a2d dataset.

sgmg has a very detailed design of the model, while HTR only designs how to refine mask. According to the settings of previous work, there should be only one gt_mask in a sample on the a2d dataset, which will make the design of HTR invalid on the a2d dataset.

In other words, I am very curious why HTR can achieve such excellent results on the a2d dataset. Did you design the model in any way? Or did you change the gt_mask loading design of the a2d dataset?

bo-miao commented 2 months ago

image

A2D has pixel-level masks for sampled frames.

Solo4working commented 2 months ago

Thanks for your reply.

I know a2d has pixel-level masks for sampled frames, but according to the previous settings, such as ReferFormer, they often sample one gt_mask as supervision. The code is as follows here.

I want to know whether you changed the way a2d is loaded with data so that it can be trained like youtube dataset? That is, for each frame, gt_mask is used for supervision without sampling?

bo-miao commented 2 months ago

I tried both version (sampling single and multiple annotated frames), all perform temporal propagation and supervision with annotated masks. Sorry I do not remember which version is used as it was done 1.5 years ago.