It is interesting in the paper that "Our observations indicate that substituting the previously used DWConv or Attention with our
DCNv4 leads to an increase in inference speed".
Could you provide the implemention details of "substituting the attention with DCNv4"?
We first remove the class token in the ViT and use average pooling to get the final representation for classification, so that we would have a regular square 2D feature map.
It is interesting in the paper that "Our observations indicate that substituting the previously used DWConv or Attention with our DCNv4 leads to an increase in inference speed".
Could you provide the implemention details of "substituting the attention with DCNv4"?