Closed MenghaoGuo closed 2 years ago
Hello and thank you for your interest, and congratulations to you on your work, and its fascinating ImageNet performance.
We strictly follow the experiments of NAT, ConvNeXt and Swin to provide a comprehensive and clear comparison of these methods, and by doing so we have to leave out comparisons to many other great works, including, but not limited to PVT, Max-ViT, and the like, because they have very different architectures and/or experiment settings.
We therefore found it difficult to compare our models to VAN because 1. VAN seems to utilize a different hybrid architecture, with depth-wise convolutions built into the MLP block, along with a different configuration that just makes it hard to directly compare VAN variants to those of Swin, ConvNeXt, NAT, and DiNAT.
I would also add that NA and DiNA are direct sliding window DPSA modules, and do not utilize (dilated) convolutions to extract weights.
I hope this answers your question, but please let me know if that is not the case.
Thanks for the detailed reply, I understand the difference between VAN and DiNAT and agree with your viewpoint.
In my opinion, although they have some difference, the core idea, which adopts dilation operation to enlarge receptive field are similar.
I think the differences between DiNAT, VAN and MaxViT should be discussed in the related work chapter.
Indeed the idea of dilation (and algorithme a trous) is not new, and it has been explored in many earlier works that go back even decades. We’ve included MaxViT in background section, but did not know about VAN at the time. Could you remind us where it’s been published? We would be happy to include more relevant works in the future.
Closing this due to inactivity. If you still have questions feel free to open it back up.
Dear authors:
Congratulations on your excellent results on DiNAT.
However, I think the idea of this paper is similar as VAN, Code.
Both of them adopting dilation operation to enlarge receptive field and make the network achieve locality and global context. Besides, both of them adopt dilation operation for visual backbone and achieve a great performance on downstream task such as semantic segmentation.
Why not compare with it ?
Best, Menghao