ChangyaoTian / VL-LTR

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
61 stars 8 forks source link

Mismatch between code and diagram in paper for the fine-tuning phase #9

Open rahulvigneswaran opened 1 year ago

rahulvigneswaran commented 1 year ago

In fig 3, stage 2 from the paper, it looks like value for the attention is calculated based on Vision and language (Q is vision, K is language) and then applied to the language (V). But in the code, the attention is applied to the visual features. Can you verify which one is the correct way? @ChangyaoTian

linzhiqiu commented 1 year ago

Are you able to figure it out? @rahulvigneswaran

rahulvigneswaran commented 1 year ago

Nope.