[2021 fall] ICLR 2020 Pre-training of Generic Visual-Linguistic Representations (VL-BERT) (20213784)

bo-10000 commented 3 years ago

서식 관련

Eng version <-> Kor version으로의 링크가 동작하지 않습니다.
Problem Definition의 마지막 문단에서 오타가 있습니다 (Linguaistic -> Linguistic)
Related works의 표와 Idea의 그림이 줄바꿈이 되어 있지 않고, 크기가 작습니다.
Method - VL-Bert Architecture에서 in-line 수식이 깨져 있습니다.

내용 관련

Visual Question Answering에 대해 조금 더 설명을 추가해 주시면 좋을 것 같습니다. 자세한 설명까지는 아니여도, 한 줄 정도의 problem definition이나 간단한 예시 정도 추가해 주시면 좋을 듯 합니다.
Method에서 MLM 이 무엇의 약자인지 덧붙혀 주면 좋을 것 같습니다.
Visual appearance feature에서, ROI는 어떻게 추출하는 것인지 이해가 잘 가지 않습니다.
Visual geometry embedding에서, 마지막에 2048-d라고 되어 있는데 4-d vector을 어떤 과정을 통해 2048-d로 만드는 것인가요?
Sequence positional embedding은 text element에만 추가된다고 되어 있는데, Figure을 보면 image에도 sequence position embedding 값이 있는 것 같습니다. 이 부분에 대한 추가 설명을 해주시면 좋을 것 같습니다.

tungngreen commented 3 years ago

Summary

The article reviews the paper titled Pre-training of Generic Visual-Linguistic Representations by Su et al. The paper focuses on a pretraining method to align visual representations and linguistic information to serve various downstream tasks.
Detailed Reviews

Strengths:
- In general, the article was very well written. All parts are covered well with enough technical details and sufficient explanations.
- The author of this article (from this point on, simply referred to as the author, and not to be confused with the authors of the original paper) has done a terrific job in the first two sections.
  - The Problem Definition section gently, but efficiently introduces the problem that the paper is trying to solve. Core concepts are mentioned and well-explained.
  - The Related Works section provides enough information related to the proposed method.
  - The idea is well summarized.
- It's very good that the author spends time sufficiently explaining BERT before going into the architecture of VL-BERT.
- The architecture, pretraining and finetuning steps are well explained. So is the rest of the paper.
Reviewer's Suggestions:
- In general, I think the article is sufficient to serve the purpose of reviewing. There are only a few details I hope the author can fix:
  - For Visual Appearance Feature and Visual Geometry Embedding, it would be great if the author can include some visualization for these concepts.
  - The formulas are not rendered by gitbook. I suggest the author use a typeset website to type the formulas and reference the links or include their rendered figures.
  - The explanations for the datasets are a bit short.
  - Some references to topics mentioned in Sections 1 and 2 should be included so the readers can read up on them. It doesn't have to be a lot, just a few papers or review articles that the authors think would help the readers.
  - Some long sentences are somewhat confusing. E.g., "A typical example is a model such as BERT using the "Masked Language Model" method in the Vision area using a pre-trained model on ImageNet data as a backbone, or in the NLP field that has appeared a little more recently." I think the author could divide it into shorter sentences.

juyebshin commented 3 years ago

안녕하세요, 신주엽입니다. 먼저 리뷰 잘 읽었습니다. Vision과 linguistic model을 결합한 task에 대해 전반적으로 이해할 수 있었습니다.

리뷰를 읽으면서 추가하거나 수정하면 좋은 것들 정리해보았습니다.

수식이 잘 나오도록 수정이 필요할 것 같습니다 ($를 $$로 수정)
Method 섹션에서 간략한 예시를 들어가며 설명하면 이해가 조금 더 잘 될 것 같습니다

감사합니다.

awesome-davian / awesome-reviews-kaist

[2021 fall] ICLR 2020 Pre-training of Generic Visual-Linguistic Representations (VL-BERT) (20213784) #124

서식 관련

내용 관련