The detection head is of a 3-layer FFN for bounding box regression, and a linear projection for bounding box binary classification (i.e., foreground and background)
Doesn't that mean we should only have 2 outputs in class_embed? (later used in here)
EDIT: after further investigation it seems that my confusion comes from this line . Why do we pick the best scoring bounding boxes based on the first class?
The paper says:
Doesn't that mean we should only have 2 outputs in
class_embed
? (later used in here)EDIT: after further investigation it seems that my confusion comes from this line . Why do we pick the best scoring bounding boxes based on the first class?