Open jialeli1 opened 3 years ago
Hi,you can refer to these lines.
Hi, I also have problem understanding that part. Specifically, in the case in witch the reference points are computed by a linear projection of the query embedding (two_stage=False) the predicted centers are refined iteratively. However, I don't understand the need to detach the new reference points, this would lead to no gradient to update the embeddings. Am I wrong? I would appreciate any hint
Hi, I also have problem understanding that part. Specifically, in the case in witch the reference points are computed by a linear projection of the query embedding (two_stage=False) the predicted centers are refined iteratively. However, I don't understand the need to detach the new reference points, this would lead to no gradient to update the embeddings. Am I wrong? I would appreciate any hint
did you figure out the reason? im also stuck here
Hi. As mentioned in A.4 in your paper, the center coordinates (x, y) and dimensions (w, h) of the box both will be refined iteratively, and the initial box is set with b_w=0.1 and b_h = 0.1. https://github.com/fundamentalvision/Deformable-DETR/blob/11169a60c33333af00a4849f1808023eba96a931/models/deformable_detr.py#L172 In the code, under the single-stage setting (two_stage=False), the reference.shape[-1] seems to be 2 consistently, including only the center coordinates, which indicates that only the center coordinates are refined iteratively. The dimensions of the box seem to be independently predicted among decoder levels, which is inconsistent with the paper and really puzzles me.
I will be very grateful if you can give me some hints.