Closed hutuo1213 closed 10 months ago
Hi @yaoyaosanqi,
The simplest way to achieve what you want is probably to just modify the function that computes the prior scores, i.e., the product of object detection scores. You can just zero out the returned prior scores when the verb is supposed to be unseen, then all the action logits will be automatically ignored.
Fred.
Thanks!
Hi, Thank you for your excellent work. We would like to try to do ZERO-shot inference. For unseen actions (UV), all unseen action labels will be removed in training.
Suppose we use 117 verb texts from the CLIP text with dimensions [117,512]. In training, a valid object-action list excludes interactions with unseen actions. So, does training use the seen verb texts (97) or still use the 117 texts?
We assume that there are no unseen actions (UVs) in training, so they are the same as invalid object-action combinations, and backpropagation does not update them (gradient is 0). May I ask if our understanding is correct?