alirezazareian / ovr-cnn

A new framework for open-vocabulary object detection, based on maskrcnn-benchmark
MIT License
226 stars 28 forks source link

Questions about Table1 result #14

Closed yechenzhi closed 2 years ago

yechenzhi commented 2 years ago

❓ Questions and Help

Under your setting, for a photo containing seen and unseen objects, you just remove the unseen objects' annotations, but under the previous zero-shot-detection setting, all the images that contain unseen objects should be removed. So under your setting, your training set contains more images, and your model can see unseen objects though not see annotations. My question is about Table 1 result. Did you re-implemented those ZSD methods you compared with under your setting?

alirezazareian commented 2 years ago

The problem is, in order to remove images that contain unseen objects, we must have annotations for unseen objects! In real-world settings, we don't know what other objects exist in each training image. Hence, we cannot discard the images that contain other objects. This is not only a more realistic setting, but it's actually more difficult too, because the model may learn that all other objects that appear are background, and it may be harder to generalize to those classes. That is exactly why we had to multiply the loss of the background class by a small weight (alpha) and tune it. Therefore, we believe our setting is the correct way to conduct zero-shot learning experiments.

Nevertheless, it is true that some other works remove images that contain unseen classes from their training data, and it is true that those models are trained with a smaller training data. Hence, it is not completely fair to compare with their numbers. We tried our best to make fair comparisons at least in our ablations, but it was not feasible to re-implement every baseline with exact same settings as ours. However, the substantial performance improvement of our method is very unlikely to be due to the larger training data.