Closed Solo4working closed 1 month ago
A2D has pixel-level masks for sampled frames.
Thanks for your reply.
I know a2d has pixel-level masks for sampled frames, but according to the previous settings, such as ReferFormer, they often sample one gt_mask
as supervision. The code is as follows here.
I want to know whether you changed the way a2d is loaded with data so that it can be trained like youtube dataset? That is, for each frame, gt_mask
is used for supervision without sampling?
I tried both version (sampling single and multiple annotated frames), all perform temporal propagation and supervision with annotated masks. Sorry I do not remember which version is used as it was done 1.5 years ago.
Thank you for such insightful work~
In my opinion, the contribution of this paper is mainly how to use the reference mask to optimize the mask of the rest of the time, while the visual backbone and segmentation are similar to the ReferFormer. What makes me curious is why it can achieve comparable performance with sgmg on the a2d dataset.
sgmg has a very detailed design of the model, while HTR only designs how to refine mask. According to the settings of previous work, there should be only one
gt_mask
in a sample on the a2d dataset, which will make the design of HTR invalid on the a2d dataset.In other words, I am very curious why HTR can achieve such excellent results on the a2d dataset. Did you design the model in any way? Or did you change the
gt_mask
loading design of the a2d dataset?