[2022 spring] ICLR 2021 ViT: An Image is Worth 16*16 Words; Transformers for Image Recognition at Scale ( 20214478)

KelaJussi commented 2 years ago

Thank you for the review! It was really in-depth and provided much information. Some things to change:

Two of the figures in Results -section are unreadable due to their size
In ViT -section there is uncompiled equation: ($=HW/P^2$)
some typos I noticed:
- first sentence its -> their
- in Conclusion: inital -> initial

Overall the topic is very abstract, Personally, being unfamiliar with the topic, I found it hard to understand the goals and methods of the paper.

koo616 commented 2 years ago

Thank you for the well-organized review. You organized the core of the paper well, so I felt that you knew about ViT and wrote the article. In particular, the description of transformer was well written in the related work section, making it easier to understand the paper.

Everything was good, but even though I know that the method section of the paper does not have other figures, it would have been better if the method section was described in detail using external images. Also, there seems to be an unused bullet point under Fine-tuning in the Experimental section, so please check it out.

It was a really good review, and finally, thank you for delivering a very interesting take home message.

pantheon5100 commented 2 years ago

Thank you for the good write review. One thing I want to suggest is that maybe we can add some relations description to other later vision transformer papers. It may give enven more hints to the readers.

Everything in this reviw is good.

awesome-davian / awesome-reviews-kaist

[2022 spring] ICLR 2021 ViT: An Image is Worth 16*16 Words; Transformers for Image Recognition at Scale ( 20214478) #511