Verg-Avesta / CounTR

CounTR: Transformer-based Generalised Visual Counting
https://verg-avesta.github.io/CounTR_Webpage/
MIT License
92 stars 9 forks source link

Zero count with this image #16

Closed jaideep11061982 closed 1 year ago

jaideep11061982 commented 1 year ago

Any reason why should model return zero count with this image. I marked some boxes each row of shelf and passed along as exempler @Verg-Avesta https://drive.google.com/file/d/1LHE8nzhVNk_e7gteL9SvBxG8TBQYcgt3/view?usp=share_link

While from same image if I pass per shelf i get the count as per models performance.

Verg-Avesta commented 1 year ago

The model is based on self-similarity of the images, so it will be confused if the exemplars are very different.

jaideep11061982 commented 1 year ago

@Verg-Avesta if we fine tune to such images. Will it be able to work ? In that case what should be density map for such images, Binary only?

Verg-Avesta commented 1 year ago

No, it won't work. It's against the basic assumption of our model.

jaideep11061982 commented 1 year ago

@Verg-Avesta 1) could you for me tell that assumption once again As I read first para in your paper "In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of “exemplars”, i.e. zero-shot or fewshot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting TRansformer (CounTR), which explicitly captures the similarity between image patches or with given “exemplars” using the attention mechanism;"

I am unable to figure out which assumption is getting violated by the first image. At first i thought it could be data domain difference that could be causing in the issues . In one of our discussions you had mentioned that model is robust to class/shape/color of the objects we count so just trying to understand

2) Interestingly for this image which is cropped from (1st shelf) first image only that show up as full rack, gives the count of 7 . I am just wondering what is working as per the assumption here ,if you could help understand https://drive.google.com/file/d/1P2I-L9t5cJUi5X04nWGb0OkBWDpbSTI5/view?usp=share_link

I highly appreciate your time in answering all qs in quick time

Verg-Avesta commented 1 year ago

The basic assumption is that the images processed should have strong self-similarity. Therefore, we can use the attention mechanism to capture such self-similarity and count the objects. If the exemplars vary too much, the model will be confused about which object to count.

At that time, I mentioned that the model "is robust to small differences in color, shape, and size", and this is due to the data augmentation we used in fine-tuning. The model is robust to small differences in the objects' appearance like color,shape and so on. And the model is definitely not robust to the change of object's class.

For the image of 1st shelf, I think it's just a coincidence. The model's result will be unreliable if the exemplars do not belong to the same category.