Zero-shot inference for unseen verbs

hutuo1213 commented 10 months ago

Hi, Thank you for your excellent work. We would like to try to do ZERO-shot inference. For unseen actions (UV), all unseen action labels will be removed in training.

Suppose we use 117 verb texts from the CLIP text with dimensions [117,512]. In training, a valid object-action list excludes interactions with unseen actions. So, does training use the seen verb texts (97) or still use the 117 texts?

We assume that there are no unseen actions (UVs) in training, so they are the same as invalid object-action combinations, and backpropagation does not update them (gradient is 0). May I ask if our understanding is correct?

    def compute_classification_loss(self, logits, prior, labels):
        prior = torch.cat(prior, dim=0).prod(1)
        x, y = torch.nonzero(prior).unbind(1)
#In backpropagation, the object corresponds to invalid verbs with a gradient value of 0, 
#so the corresponding parameters are not updated.

        logits = logits[:, x, y]
        prior = prior[x, y]
        labels = labels[None, x, y].repeat(len(logits), 1)

fredzzhang commented 10 months ago

Hi @yaoyaosanqi,

The simplest way to achieve what you want is probably to just modify the function that computes the prior scores, i.e., the product of object detection scores. You can just zero out the returned prior scores when the verb is supposed to be unseen, then all the action logits will be automatically ignored.

Fred.

hutuo1213 commented 10 months ago

Thanks！

fredzzhang / pvic

Zero-shot inference for unseen verbs #43