Since the multi-scale deformable attention module extracts image features around the reference
point, we design the detection head to predict the bounding box as relative offsets w.r.t. the reference
point to further reduce the optimization difficulty.
Can you please explain me or point me to code, where and how the head is designed to predict the bounding box as relative offsets w.r.t. the reference point ? If we want to try to predict absolute bounding boxes (I understand that it was predicted as relative to reduce the optimization difficulty), is it possible without altering the MSDeformableAttention module ?
In A.3 of the paper,
Can you please explain me or point me to code, where and how the head is designed to predict the bounding box as relative offsets w.r.t. the reference point ? If we want to try to predict absolute bounding boxes (I understand that it was predicted as relative to reduce the optimization difficulty), is it possible without altering the MSDeformableAttention module ?