Open JGuillaumin opened 4 days ago
Hi,
Thank you for your interest to our work. For your questions:
What is the difference between between
models/
andimpl_a/models/
?
impl_a
is the implementation (a) of mixed supervision as illustrated by Figure 4 (a) in our paper. The main difference lies in the deformable_detr.py (L122-L125, L198-L219) for impl_a
we did not change the architecture of decoder layers and adds an auxiliary predictors for one-to-many predictions. More details are available in Section 3.3 of our paper.
Does the model and the training process is compatible with
fp16
precision ?
We did not test it under fp16
precision, it depends on the whether the MS-Deform operators, which we directly borrowed from Deformable-DETR supports fp16
precision. Maybe you can check in the original implementation repository, and as I know some third party implementation has supported fp16
for MS-Deform operators.
In DeformableAttention, do you use reference point or bbox reference ?
We use the reference points.
What is the role of
self.im2col_step = 64
inMSDeformAttn
?
The im2col_step
may relates to some memory efficiency operations in the tensor operations as implemented by Deformable-DETR. I did not investigate it in detail.
And the attention weights computed in MSDeformAttn is also different from what vanilla attention operation in PyTorch, the attention weights is computed by Q and reference points, as described in the paper of Deformable-DETR, you can check it for more details.
I hope this clears up your questions. Feel free to reach out if you have other questions.
Hi, First thank your very much for your work. It adds a huge improvement to DETR family. And your paper was really well explained and written. Also thank you for publishing your code & models, it was very easy to run it.
I have few questions about the implementation :
models/
andimpl_a/models/
? (I compared few files, I only identified some typo changes, but I don't want to miss something)fp16
precision ?reference_points
is(bs, len_q, n_levels, 2)
or(bs, len_q, n_levels, 4)
?)self.im2col_step = 64
inMSDeformAttn
?class MSDeformAttn(nn.Module)
:From what I understood of attention mechanism, normally schematically :
attention_weights = softmax(dot_product(proj_q(Q), proj_k(K)))
, but we haveattention_weights = softmax(proj_q(Q))
, whereproj_q
isself.attention_weights = nn.Linear(d_model, n_heads * n_levels * n_points)
.