Closed Big-Brother-Pikachu closed 2 years ago
Hello, sorry for the delay as I'm currently on holiday. So I'll need to elaborate this in the paper for sure (for submission to *ACL or ICASSP), but yes it's different from the standard evaluation protocol, where we're not given multiple masks, although that may be mildly implicit in the per-pixel probability distribution across the labels.
If we want to evaluate IoU on a multi-object, multi-noun image, we could probably take the argmax of the attention values for the various nouns. I imagine a similar approach already exists to use CLIP for that, which could serve as inspiration. To take it one step further, we could add a CRF with some segmentation kernels, though that might be entering the realm of strong supervision.
I also saw some work applying diffusion models to segmentation. I haven't had the chance to go through this paper thoroughly, but it's definitely related.
Hi, thanks for your careful reply. We have done nearly the same thing as you in the last month. We evaluate the method under weakly supervised semantic segmentation protocol on PASCAL VOC2012 (real images). We directly compare the attention value for different classes with the prompt "a photo of the ...". The results are surprisingly well compared to CAM-based methods like Cross Language Image Matching for Weakly Supervised Semantic Segmentation (We can improve nearly 8 points regarding mIOU). But we have some troubles when refining the initial results, we think it is due to the direct comparison argmax procedure (the details are kind of jumbled and it is hard to explain it here). Maybe we can discuss this together if you have time as I see you plan to work on real images in the future work. I hope there will be more interaction if you are interested in our progress!
Happy to talk about your progress!
I have sent you a short slide about our work via email.
Hi, thanks for sharing this wonderful work! I have some questions regarding multi-object images. As I see in the codes: https://github.com/castorini/daam/blob/299de093b357adf84898852c9942d8866f46fdd6/daam/run/evaluate.py#L157-L171 You compute the binary mask for each object. Am I correct? Then I think it is a little different from the standard segmentation evaluation protocol, as we compute each pixel's label for one image. Thus if we want to evaluate IOU on a multi-object image, what should DAAM do? What if one pixel activates for several nouns? Can we compare the attention value for different nouns directly? Looking forward to your reply, thanks.