bigshanedogg / survey

2 stars 0 forks source link

[SLIP] SLIP: Self-supervision meets Language-Image Pre-training #19

Open bigshanedogg opened 1 year ago

bigshanedogg commented 1 year ago

Problem statement

  1. image와 language pair의 관계에 대한 multi-modal pre-training을 넘어서, language supervision과 image self-supervision을 통해 데이터 효율을 높여본다.
    • CLIP + SimCLR(image supervision)을 해보자

Glossary

Baseline

Data details

name abbr type format source size description remark related tasks
ImageNet-1K image (image, class) 1K classes, no-labels, highly-curated classification, captioning
YFCC15M image 15M English title & descriptions only image-text pretraining
Conceptual Captions CC3M image (image, caption) 3M image-text pretraining
Conceptual 12M CC12M image (image, caption) 12M image-text pretraining
DTD image downstream transferability, little overlap with the semantic distribution of YFCC15M classification
SST2 image downstream transferability, little overlap with the semantic distribution of YFCC15M classification
KITTI image downstream transferability, little overlap with the semantic distribution of YFCC15M classification

Approach

Evaluation

Limitations

(친절하게 별개 섹션에서 예상 질답이 적혀있다.)

bigshanedogg commented 1 year ago

2112.12750.pdf