% yyyy/mm/dd
@article{
oquab2023dinov2,
title={DINOv2: Learning Robust Visual Features without Supervision},
author={Maxime Oquab and Timothée Darcet and Théo Moutakanni and Huy Vo and Marc Szafraniec and Vasil Khalidov and Pierre Fernandez and Daniel Haziza and Francisco Massa and Alaaeldin El-Nouby and Mahmoud Assran and Nicolas Ballas and Wojciech Galuba and Russell Howes and Po-Yao Huang and Shang-Wen Li and Ishan Misra and Michael Rabbat and Vasu Sharma and Gabriel Synnaeve and Hu Xu and Hervé Jegou and Julien Mairal and Patrick Labatut and Armand Joulin and Piotr Bojanowski},
journal=arXiv # "2304.07193",
year={2023}
}
self-supervised methods, can produce foundation model ( task-agnostic pretrained representations) in computer vision.
contributions are accelerating and stabilizing training at scale (2x faster, 3x less memory):
create pipeline to build a curated dataset.
rebalance concepts and avoid overfitting on a few dominant modes .
train ViT (1B params) first and distill into smaller models.
surpass OpenCLIP (weakly supervised) on the most of benchmarks.
especially perform well at dense recognition task.
results in 8 tasks.
results of ImageNet-1k
results of domain generalization
The whole processing is distributed on a compute cluster of 20 nodes equipped with 8 V100-32GB GPUs and takes less than two days to produce the LVD-142M dataset.
Compare to previous works
Intra-image self-supervised training (like MAE) -> the feature requires supervised finetuning.
Discriminative self-supervised learning (like Deep Cluster) -> hard to scale to larger model sizes.
論文リンク
公開日(yyyy/mm/dd)
2023/04/14
概要
Research Question
研究で明らかにしたい問を端的に表したもの.
Elevator Pitch
[潜在的なニーズを満たしたり、潜在的な課題を解決したり] したい [対象ユーザー] 向けの, [提案手法] という手法は, [提案手法のカテゴリー] です. これは [提案手法の出来ること] ができ, [代替手段のSoTA] とは違って, [差別化の決定的な特徴] が備わっている.
TeX