New Study : https://www.notion.so/Reading-Papers-Deep-Learning-504b50ddaed14360b34dfd6d49cb3455

Update 2024.01.09

Paper Review

개인 공부라 열심히는 하고 있으나, 완벽한 리뷰가 아닙니다.
리뷰가 끝나더라도 계속 의문/생각/교정/좋은자료가 있다면 꾸준히 업데이트 됩니다.
link review는 다른 분들이 하신 좋은 리뷰를 링크한 것입니다.
light_link는 빠르게 개념(abstract)정도로 본 논문을 의미.
현재 상황이 리뷰 공개를 못하고 있는 상황이라, 논문 링크로만 정리진행합니다.

Virtual Try On [Link]

Asymmetric Image Retrieval [Link]

Diffusion [Link]

Deep Learning

Revisiting Small Batch Training for Deep Neural Networks : [paper][review]
Weight Standardization : [paper][link_review] [link_review]
Effects of Image Size on Deep Learning : [paper]
Inductive Bias : [link_review]

Multi-Label Image Recognition

Learning Discriminative Representations for Multi-Label Image Recognition : [paper]

Knowledge distillation

Knowledge distillation: A good teacher is patient and consistent : [paper]
Hierarchical Self-supervised Augmented Knowledge Distillation : [paper]
Text is Text, No Matter What: Unifying Text Recognition using Knowledge Distillation : [paper]

Vision and Language Pre-trained [Link]

CLIP & joined multi-modal

CLIP : Learning Transferable Visual Models From Natural Language Supervision : [paper] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review]
How Much Can CLIP Benefit Vision-and-Language Tasks? : [paper]
Zero-Shot Open Set Detection by Extending CLIP : [paper]

Efficient training & trick

Bag of Tricks for Image Classification with Convolutional Neural Networks : [paper] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review]

Imbalance Datasets

Self Supervised Learninig & unsupervised learning & semi/Weakly supervised learning

Unsupervised Representation Learning by Predicting Image Rotations : [paper][]
Unsupervised Visual Representation Learning by Context Prediction : [paper][]
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles : [paper][]
Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks : [paper][]
Rethinking Pre-training and Self-training : [paper][]
Selfie: Self-supervised Pretraining for Image Embedding : [paper] [light_review]
Self-training with Noisy Student improves ImageNet classification : [paper] [review]
SimCLR : A Simple Framework for Contrastive Learning of Visual Representations : [paper] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review]
SimCLR V2:Big Self-Supervised Models are Strong Semi-Supervised Learners : [paper]
MoCo : Momentum Contrast for Unsupervised Visual Representation Learning : [paper]
MoCo V2 : Improved Baselines with Momentum Contrastive Learning : [paper] [link_review] [link_review]
MoCo V3 : An Empirical Study of Training Self-Supervised Vision Transformers: [paper] [link_review] [link_review]
BYOL : Bootstrap your own latent: A new approach to self-supervised Learning: [paper]
Exploring the limits of weakly supervised pretraining : [paper]
Triplet is All You Need with Random Mappings for Unsupervised Visual Representation Learning : [paper]
ScatSimCLR: self-supervised contrastive learning with pretext task regularization for small-scale datasets : [paper]

Self Supervised Training + Mask based Token + Transformer

MST: Masked Self-Supervised Transformer for Visual Representation : [paper]
Masked Autoencoders Are Scalable Vision Learners : [paper]
SimMIM: A Simple Framework for Masked Image Modeling : [paper]

Self Supervised Training + Instance Image Retrival

InsCLR: Improving Instance Retrieval with Self-Supervision : [paper]

Vision Transformers classification

Stand-Alone Self-Attention in Vision Models : [paper][review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review]
Selfie: Self-supervised Pretraining for Image Embedding : [paper] [light_review] [link_review] [link_review]
ViT:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: [paper] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review]
DeiT:Training data-efficient image transformers & distillation through attention : [paper] [link_review] [link_review] [link_review] [link_review]
Bottleneck Transformers for Visual Recognition: [paper] [link_review]
Going deeper with Image Transformers: [paper]
Rethinking Spatial Dimensions of Vision Transformers : [paper]
On the Adversarial Robustness of Visual Transformers: [paper]
TransFG: A Transformer Architecture for Fine-grained Recognition : [paper]
Understanding Robustness of Transformers for Image Classification : [paper]
DeepViT: Towards Deeper Vision Transformer : [paper]
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification : [paper]
CvT: Introducing Convolutions to Vision Transformers: [paper] [link_review]
Efficient Feature Transformations for Discriminative and Generative Continual Learning : [paper]
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows : [paper] [link_review] [link_review] [link_review]
Can Vision Transformers Learn without Natural Images?: [paper]
Scaling Local Self-Attention for Parameter Efficient Visual Backbones: [paper]
Incorporating Convolution Designs into Visual Transformers : [paper]
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases : [paper]
Explicitly Modeled Attention Maps for Image Classification : [paper]
Conditional Positional Encodings for Vision Transformers : [paper]
Transformer in Transformer: [paper] [link_review]
A Survey on Visual Transformer: [paper]
Co-Scale Conv-Attentional Image Transformers: [paper]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity : [paper] [link_review]
LocalViT: Bringing Locality to Vision Transformers : [paper]
Visformer: The Vision-friendly Transformer : [paper]
Multiscale Vision Transformers : [paper] [link_review] [link_review]
So-ViT: Mind Visual Tokens for Vision Transformer: [paper]
Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet (이후 "All Tokens Matter: Token Labeling for Training Better Vision Transformers 변경"): [paper]
Fourier Image Transformer: [paper]
Emerging Properties in Self-Supervised Vision Transformers: [paper]
ConTNet: Why not use convolution and transformer at the same time?: [paper]
Twins: Revisiting Spatial Attention Design in Vision Transformers: [paper]
MoCo V3 :An Empirical Study of Training Self-Supervised Vision Transformers: [paper] [link_review] [link_review]
Conformer: Local Features Coupling Global Representations for Visual Recognition: [paper]
Self-Supervised Learning with Swin Transformers: [paper]
Are Pre-trained Convolutions Better than Pre-trained Transformers?: [paper]
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference: [paper]
Are Convolutional Neural Networks or Transformers more like human vision?: [paper]
Rethinking Skip Connection with Layer Normalization in Transformers and ResNets: [paper]
Rethinking the Design Principles of Robust Vision Transformer (Towards Robust Vision Transformer): [paper]
Longformer: The Long-Document Transformer : [paper] [link_review] [link_review] [link_review] [link_review]
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding: [paper]
On the Robustness of Vision Transformers to Adversarial Examples: [paper]
Refiner: Refining Self-attention for Vision Transformers: [paper]
Patch Slimming for Efficient Vision Transformers: [paper]
RegionViT: Regional-to-Local Attention for Vision Transformers: [paper]
X-volution: On the unification of convolution and self-attention: [paper]
The Image Local Autoregressive Transformer: [paper]
Glance-and-Gaze Vision Transformer: [paper]
Semantic Correspondence with Transformers: [paper]
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification: [paper]
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations: [paper] [link_review]
KVT: k-NN Attention for Boosting Vision Transformers: [paper]
Less is More: Pay Less Attention in Vision Transformers: [paper]
FoveaTer: Foveated Transformer for Image Classification: [paper]
An Attention Free Transformer: [paper]
Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length: [paper]
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks: [paper] [link_review]
Pre-Trained Image Processing Transformer: [paper] [link_review]
ResT: An Efficient Transformer for Visual Recognition: [paper]
Towards Robust Vision Transformer: [paper]
Aggregating Nested Transformers: [paper]
GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification: [paper]
Intriguing Properties of Vision Transformers: [paper] [link_review] [link_review] [link_review]
Vision Transformers are Robust Learners: [paper]
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer: [paper]
A Survey of Transformers: [paper]
Armour: Generalizable Compact Self-Attention for Vision Transformers : [paper]
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer : [paper]
Dual-stream Network for Visual Recognition : [paper]
BEiT: BERT Pre-Training of Image Transformers : [paper]
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions : [paper]
PVTv2: Improved Baselines with Pyramid Vision Transformer : [paper]
Thinking Like Transformers : [paper]
CMT: Convolutional Neural Networks Meet Vision Transformers : [paper] [link_review] [link_review]
Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition : [paper]
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias : [paper]
Visual Transformer Pruning : [paper]
Local-to-Global Self-Attention in Vision Transformers : [paper]
Feature Fusion Vision Transformer for Fine-Grained Visual Categorization : [paper]
Vision Xformers: Efficient Attention for Image Classification : [paper]
EsViT : Efficient Self-supervised Vision Transformers for Representation Learning : [paper]
GLiT: Neural Architecture Search for Global and Local Image Transformer : [paper]
Efficient Vision Transformers via Fine-Grained Manifold Distillation : [paper]
What Makes for Hierarchical Vision Transformer? : [paper]
AutoFormer: Searching Transformers for Visual Recognition : [paper]
Focal Self-attention for Local-Global Interactions in Vision Transformers : [paper] [link_review]
ConvNets vs. Transformers: Whose Visual Representations are More Transferable? : [paper]
Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight : [paper]
Mobile-Former: Bridging MobileNet and Transformer : [paper]
Image Fusion Transformer : [paper]
PSViT: Better Vision Transformer via Token Pooling and Attention Sharing : [paper]
Do Vision Transformers See Like Convolutional Neural Networks? : [paper]
Linformer: Self-Attention with Linear Complexity : [paper] [link_review]
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows : [paper]
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers : [paper]
Searching for Efficient Multi-Stage Vision Transformers : [paper]
Exploring and Improving Mobile Level Vision Transformers : [paper]
XCiT: Cross-Covariance Image Transformers : [paper]
Efficient Vision Transformers via Fine-Grained Manifold Distillation : [paper]
Scaled ReLU Matters for Training Vision Transformers : [paper]
VOLO: Vision Outlooker for Visual Recognition : [paper]
CoAtNet: Marrying Convolution and Attention for All Data Sizes : [paper] [link_review] [link_review]
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer : [paper]
A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition : [paper]
Improved Multiscale Vision Transformers for Classification and Detection : [paper]

Vision Transformers positional embedding

Self-Attention with Relative Position Representations : [paper] [link_review]
Vision Transformer with Progressive Sampling : [paper]
DPT: Deformable Patch-based Transformer for Visual Recognition : [paper]
CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings : [paper]
Rethinking and Improving Relative Position Encoding for Vision Transformer : [paper]
Rethinking Positional Encoding : [paper]
Relative Positional Encoding for Transformers with Linear Complexity : [paper]
Conditional Positional Encodings for Vision Transformers : [paper]
Pyramid Adversarial Training Improves ViT Performance : [paper]
Shunted Self-Attention via Multi-Scale Token Aggregation : [paper]
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition : [paper]
ATS: Adaptive Token Sampling For Efficient Vision Transformers : [paper]
Global Interaction Modelling in Vision Transformer via Super Tokens : [paper]

Vision Transformers vs MLP (or Others)

AS-MLP: An Axial Shifted MLP Architecture for Vision : [paper]
S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision : [paper]
ResMLP: Feedforward networks for image classification with data-efficient training: [paper]
Pay Attention to MLPs: [paper]
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet: [paper]
MLP-Mixer: An all-MLP Architecture for Vision : [paper]
Sparse-MLP: A Fully-MLP Architecture with Conditional Computation : [paper]
ConvMLP: Hierarchical Convolutional MLPs for Vision : [paper]
Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition : [paper]
MetaFormer is Actually What You Need for Vision : [paper]
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers : [paper] *

Vision Transformers retrieval

Investigating the Vision Transformer Model for Image Retrieval Tasks: [paper]
Training Vision Transformers for Image Retrieval: [paper]
Instance-level Image Retrieval using Reranking Transformers: [paper]
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval: [paper]
TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval : [paper]
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations : [paper]
Vision Transformer Hashing for Image Retrieval : [paper]

Vision Transformers segmentation and detection

CoSformer: Detecting Co-Salient Object with Transformers: [paper]
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding: [paper]
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks: [paper]
Medical Image Segmentation Using Squeeze-and-Expansion Transformers: [paper]
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers: [paper]
Visual Transformers: Token-based Image Representation and Processing for Computer Vision : [paper][]
DETR:End-to-End Object Detection with Transformers : [paper] [link_review] [link_review] [link_review] [link_review] [link_review]
Unifying Global-Local Representations in Salient Object Detection with Transformer : [paper]
A Unified Efficient Pyramid Transformer for Semantic Segmentation : [paper]
Dual-stream Network for Visual Recognition : [paper]
MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers : [paper]
Vision Transformers with Patch Diversification : [paper]
Improve Vision Transformers Training by Suppressing Over-smoothing : [paper]
SOTR: Segmenting Objects with Transformers : [paper]
Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer : [paper]
Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers : [paper]
Unifying Global-Local Representations in Salient Object Detection with Transformer : [paper]
Conditional DETR for Fast Training Convergence : [paper]
Fully Transformer Networks for Semantic Image Segmentation : [paper]
Segmenter: Transformer for Semantic Segmentation : [paper]
nnFormer: Interleaved Transformer for Volumetric Segmentation : [paper]
Benchmarking Detection Transfer Learning with Vision Transformers : [paper]

Vision Transformers video

An Image is Worth 16x16 Words, What is a Video Worth?: [paper]
Token Shift Transformer for Video Classification : [paper]

Vision Transformers face

Robust Facial Expression Recognition with Convolutional Visual Transformers : [paper]
Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition : [paper]

Vision Transformers OCR

NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition : [paper][]
On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention : [paper][]
2D Attentional Irregular Scene Text Recognizer : [paper][]

Vision Transformers multi-modal

ReFormer: The Relational Transformer for Image Captioning : [paper]
Long-Short Transformer: Efficient Transformers for Language and Vision : [paper]

Vision Transformers GAN

A Hierarchical Transformation-Discriminating Generative Model for Few Shot Anomaly Detection : [paper]
ViTGAN: Training GANs with Vision Transformers : [paper]
Styleformer: Transformer based Generative Adversarial Networks with Style Vector : [paper]
Combining Transformer Generators with Convolutional Discriminators : [paper]

Facebook AI Image Similarity Challenge

3rd Place: A Global and Local Dual Retrieval Solution to Facebook AI Image Similarity Challenge : [paper]

Google Landmark Challenge

Image Retrieval (Instance level Image Retrieval) & Deep Feature

(My paper) All the attention you need: Global-local, spatial-channel attention for image retrieval : [paper]
Large-Scale Image Retrieval with Attentive Deep Local Features : [paper] [review]
NetVLAD: CNN architecture for weakly supervised place recognition : [paper][review]
Learning visual similarity for product design with convolutional neural networks : [paper][review]
Bags of Local Convolutional Features for Scalable Instance Search : [paper][review]
Neural Codes for Image Retrieval : [paper][review]
Conditional Similarity Networks : [paper][review]
End-to-end Learning of Deep Visual Representations for Image Retrieval : [paper][review]
CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples : [paper][review]
Image similarity using Deep CNN and Curriculum Learning : [paper][review]
Faster R-CNN Features for Instance Search : [paper][review]
Regional Attention Based Deep Feature for Image Retrieval : [paper][review]
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination : [paper][review]
Object retrieval with deep convolutional features : [paper][review]
Cross-dimensional Weighting for Aggregated Deep Convolutional Features : [paper][review]
Learning Embeddings for Product Visual Search with Triplet Loss and Online Sampling : [paper][review]
Saliency Weighted Convolutional Features for Instance Search : [paper][review]
2018 Google Landmark Retrieval Challenge 리뷰 : [review]
2019 Google Landmark Retrieval Challenge 리뷰 : [review]
REMAP: Multi-layer entropy-guided pooling of dense CNN features for image retrieval : [paper][review]
Large-scale Landmark Retrieval/Recognition under a Noisy and Diverse Dataset : [paper][review]
Fine-tuning CNN Image Retrieval with No Human Annotation : [paper][review]
Large Scale Landmark Recognition via Deep Metric Learning : [paper][review]
Deep Aggregation of Regional Convolutional Activations for Content Based Image Retrieval : [paper][review]
Challenging deep image descriptors for retrieval in heterogeneous iconographic collections : [paper][review]
A Benchmark on Tricks for Large-scale Image Retrieval : [paper][review]
Attention-Aware Generalized Mean Pooling for Image Retrieval : [paper][review]
Class-Weighted Convolutional Features for Image Retrieval : [paper][review] # 100th
deep image retrieval loss (계속 업데이트):[paper][review]
Matchable Image Retrieval by Learning from Surface Reconstruction:[paper][review]
Combination of Multiple Global Descriptors for Image Retrieval:[paper][review]
Unifying Deep Local and Global Features for Efficient Image Search:[paper][review]
ACTNET: end-to-end learning of feature activations and multi-stream aggregation for effective instance image retrieval:[paper][review]
Google Landmarks Dataset v2 A Large-Scale Benchmark for Instance-Level Recognition and Retrieval:[paper][review]
Detect-to-Retrieve: Efficient Regional Aggregation for Image Search:[paper][review]
Local Features and Visual Words Emerge in Activations:[paper][review]
Image Retrieval using Multi-scale CNN Features Pooling: [paper][review]
MultiGrain: a unified image embedding for classes and instances: [paper][link_review] [link_review]
Divide and Conquer the Embedding Space for Metric Learning: [paper][link_review]
An Effective Pipeline for a Real-world Clothes Retrieval System: [paper][light_review]
Instance Similarity Learning for Unsupervised Feature Representation : [paper]
Towards Accurate Localization by Instance Search : [paper]
The 2021 Image Similarity Dataset and Challenge : [paper]
DOLG:Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features : [paper]
Towards A Fairer Landmark Recognition Dataset : [paper]
Recall@k Surrogate Loss with Large Batches and Similarity Mixup : [paper]

Metric Learning

Deep metric learning using Triplet network : [paper][review]
FaceNet: A Unified Embedding for Face Recognition and Clustering : [paper][review]
Sampling Matters in Deep Embedding Learning : [paper][review]
Online Progressive Deep Metric Learning : [paper]

Fashion Image Retrieval

Learning Embeddings for Product Visual Search with Triplet Loss and Online Sampling : [paper][review]
Conditional Similarity Networks : [paper][review]
Semi-supervised Feature-Level Attribute Manipulation for Fashion Image Retrieval : [paper][link_review]

Fashion Compatibility & Outfit Recommendation

Context-Aware Visual Compatibility Prediction: [paper][review] [light_review]
Learning Type-Aware Embeddings for Fashion Compatibility : [paper] [review]
OutfitNet: Fashion Outfit Recommendation with Attention-Based Multiple Instance Learning : [paper]

Personalized Outfit Recommendation & fashion outfit

FashionNet: Personalized Outfit Recommendation with Deep Neural Network: [paper][review]
Self-supervised Visual Attribute Learning for Fashion Compatibility : [paper]
Personalized Outfit Recommendation with Learnable Anchors : [paper]
PAI-BPR: Personalized Outfit Recommendation Scheme with Attribute-wise Interpretability : [paper]
Hierarchical Fashion Graph Network for Personalized Outfit Recommendation : [paper]

Fashion multi-modal

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain : [paper]
Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback : [paper]

Fashion DataSets

SHIFT15M: Multiobjective Large-Scale Fashion Dataset with Distributional Shifts : [paper]

Retail & Product & Instance

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretrainingr : [paper]
RP2K: A Large-Scale Retail Product Dataset for Fine-Grained Image Classification : [paper]
eProduct: A Million-Scale Visual Search Benchmark to Address Product Recognition Challenges : [paper]
Regional Maximum Activations of Convolutions with Attention for Cross-domain Beauty and Personal Care Product Retrieval:[paper][review]
Learning visual similarity for product design with convolutional neural networks : [paper][review]
Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretrainingr : [paper]
he Met Dataset: Instance-level Recognition for Artworks : [paper]

Image Retrieval using Deep Hash

Deep Learning of Binary Hash Codes for Fast Image Retrieval : [paper][review]
Feature Learning based Deep Supervised Hashing with Pairwise Labels : [paper][review]
Deep Supervised Hashing with Triplet Labels : [paper][review]
Online Hashing with Similarity Learning : [paper]

Video Classification

NetVLAD: CNN architecture for weakly supervised place recognition : [paper][review]
Learnable pooling with Context Gating for video classification : [paper][review]
Less is More: Learning Highlight Detection from Video Duration : [paper][review]
Efficient Video Classification Using Fewer Frames : [paper][review]

OCR - Recognition

Synthetically Supervised Feature Learning for Scene Text Recognition : [paper][review]
FOTS: Fast Oriented Text Spotting with a Unified Network : [paper][review]
Robust Scene Text Recognition with Automatic Rectification : [paper][review]
Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition : [paper]

OCR - Detection

PixelLink: Detecting Scene Text via Instance Segmentation : [paper][review]
EAST: An Efficient and Accurate Scene Text Detector : [paper][review]
Scene Text Detection with Supervised Pyramid Context Network : [paper][review]
FOTS: Fast Oriented Text Spotting with a Unified Network : [paper][review]
Character Region Awareness for Text Detection : [paper][review]

Attention & Deformation

Squeeze Excitation Networks : [paper][review]
Spatial Transformer Network : [paper][review]
Tell Me Where to Look: Guided Attention Inference Network : [paper][review]
CBAM: Convolutional Block Attention Module : [paper][review]
BAM: Bottleneck Attention Module : [paper][review]
Neural Machine Translation by Jointly Learning to Align and Translate : [paper][review]
Residual Attention Networks for Image Classification : [paper][review]
Attention is all you need : [paper][review][link_review]
Residual Attention Network for Image Classification : [paper][review]
Stand-Alone Self-Attention in Vision Models : [paper][review] [light_review] [light_review] [light_review] [light_review] [light_review]

Visual & Textual Embedding

DeViSE: A Deep Visual-Semantic Embedding Model : [paper][review]
Dual Attention Networks for Multimodal Reasoning and Matching : [paper][review]
Learning Deep Structure-Preserving Image-Text Embeddings : [paper][review]
Learning Two-Branch Neural Networks for Image-Text Matching Tasks : [paper] [link_review]

CNN

Imagenet classification with deep convolutional neural networks : [paper][review]
Going Deeper with Convolutions : [paper][review]
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices : [paper][review]
Deep Residual Learning for Image Recognition : [paper][review]
Aggregated Residual Transformations for Deep Neural Networks : [paper][review]
Very Deep Convolutional Networks for Large-Scale Image Recognition : [paper][review]
Squeeze Excitation Networks : [paper][review]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications : [paper][review]
Pelee: A Real-Time Object Detection System on Mobile Devices : [paper][review]
Residual Attention Network for Image Classification : [paper][review]
Wide Residual Networks : [paper][review]
Stand-Alone Self-Attention in Vision Models : [paper][review]
Selective Kernel Networks : [paper][review]
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks : [paper] [link_review]
CSPNet: A New Backbone that can Enhance Learning Capability of CNN : [paper] [link_review] [link_review]
RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition : [paper]

Transfer Learning

Taskonomy: Disentangling Task Transfer Learning : [paper][link_review]
What makes ImageNet good for transfer learning? : [paper][review]

Generative Adversarial Nets

Generative Adversarial Nets : [paper][review]
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks : [paper][review]
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks : [paper][review]
Progressive Growing of GANs for Improved Quality, Stability, and Variation : [paper][review]
Beholder-GAN: Generation and Beautification of Facial Images with Conditioning on Their Beauty Level : [paper][review]
Synthetically Supervised Feature Learning for Scene Text Recognition : [paper][review]
A Style-Based Generator Architecture for Generative Adversarial Networks : [paper][review]
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs : [paper][review]
Everybody Dance Now : [paper][review]
Be Your Own Prada: Fashion Synthesis with Structural Coherence : [paper][review]
Fashion-Gen: The Generative Fashion Dataset and Challenge : [paper][review]
StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks : [paper][review]
DwNet: Dense warp-based network for pose-guided human video generation: [paper][review]

Face

FaceNet: A Unified Embedding for Face Recognition and Clustering : [paper][review]
The Devil of Face Recognition is in the Noise : [paper][link_review]
Revisiting a single-stage method for face detection : [paper][review
MixFaceNets: Extremely Efficient Face Recognition Networks : [paper]

Pose Estimation

Convolutional Pose Machines : [paper][review]

NLP/NLU

Efficient Estimation of Word Representations in Vector Space : [paper][review]
node2vec: Scalable Feature Learning for Networks : [paper][review]
Transfomer(self attention) 기본 이해하기 : PPT정리
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding : [paper][review](~ing)

Learning to Rank

DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval : [paper][review]
SNRM: From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing : [paper][review]
TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank : [paper][review]
ConvRankNet: Deep Neural Network for Learning to Rank Query-Text Pairs : [paper][review]
KNRM: End-to-End Neural Ad-hoc Ranking with Kernel Pooling : [paper][review]
Conv-KNRM: Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search : [paper][review]
PACRR: A position-aware neural IR model for relevance matching : [paper][link_review]
CEDR: Contextualized Embeddings for Document Ranking #262 : [paper][link]
Deeper Text Understanding for IR with Contextual Neural Language Modeling : [paper][link]
Simple Applications of BERT for Ad Hoc Document Retrieval : [paper][link]
Document Expansion by Query Prediction : [paper][link]
Passage Re-ranking with BERT : [paper][link]

Domain Adaptation

Domain-Adversarial Training of Neural Networks : [paper][review]

Curriculum Learning

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images : [paper][review]

Image Segmentation

U-Net: Convolutional Networks for Biomedical Image Segmentation : [paper][review]
Mask R-CNN : [paper][review]
Fully Convolutional Networks for Semantic Segmentation : [paper][review]
Cascade Decoder: A Universal Decoding Method for Biomedical Image Segmentation : [paper] [review]
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stochastic Inference : [paper] [link_review] [link_review]
Path Aggregation Network for Instance Segmentation : [paper] [link_review]

Localization

YOLO: Real-Time Object Detection : [paper][review]
YOLO9000: Better, Faster, Stronger : [paper][review]
YOLOv4: Optimal Speed and Accuracy of Object Detection : [paper] [link_review] [link_review] [link_review] [link_review] [link_review] [link_review]
Scaled-YOLOv4: Scaling Cross Stage Partial Network : [paper]
Faster R-CNN : [paper][review]
- faster rcnn의 anchor generator 개념 뿐만 아니라 소스레벨에서도 이해하기 : [review]
SSD: Single Shot MultiBox Detector : [paper][link_review]
- Why normalization performed only for conv4_3? : [review]
Pelee: A Real-Time Object Detection System on Mobile Devices : [paper][review]
R-FCN: Object Detection via Region-based Fully Convolutional Networks: [paper][review]
Revisiting a single-stage method for face detection: [paper][review]
DSSD : Deconvolutional Single Shot Detector: [paper][review]
Feature-fused SSD: fast detection for small objects : [paper][link_review]
EfficientDet ： Scalable and Efficient Object Detection : [paper] [link_review] [review]
FCOS: Fully Convolutional One-Stage Object Detection : [paper] [light_review]
Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection : [paper] [light_review]
Oriented R-CNN for Object Detection : [paper]
CSPNet: A New Backbone that can Enhance Learning Capability of CNN : [paper] [link_review] [link_review]

AutoML

Learning Transferable Architectures for Scalable Image Recognition : [paper][link_review]

Image Quality

Learning to Compose with Professional Photographs on the Web : [paper][review]
Photo Aesthetics Ranking Network with Attributes and Content Adaptation : [paper][review]
Composition-preserving Deep Photo Aesthetics Assessment : [paper][review]
Deep Image Aesthetics Classification using Inception Modules and Fine-tuning Connected Layer : [paper][review]
NIMA: Neural Image Assessment : [paper][review]

Others

Neural Arithmetic Logic Units : [paper][link_review]

chullhwan-song / Reading-Paper

readme

Paper Review

Virtual Try On [Link]

Asymmetric Image Retrieval [Link]

Diffusion [Link]

Deep Learning

Multi-Label Image Recognition

Knowledge distillation

Vision and Language Pre-trained [Link]

CLIP & joined multi-modal

Efficient training & trick

Imbalance Datasets

Self Supervised Learninig & unsupervised learning & semi/Weakly supervised learning

Self Supervised Training + Mask based Token + Transformer

Self Supervised Training + Instance Image Retrival

Vision Transformers classification

Vision Transformers positional embedding

Vision Transformers vs MLP (or Others)

Vision Transformers retrieval

Vision Transformers segmentation and detection

Vision Transformers video

Vision Transformers face

Vision Transformers OCR

Vision Transformers multi-modal

Vision Transformers GAN

Facebook AI Image Similarity Challenge

Google Landmark Challenge

Image Retrieval (Instance level Image Retrieval) & Deep Feature

Metric Learning

Fashion Image Retrieval

Fashion Compatibility & Outfit Recommendation

Personalized Outfit Recommendation & fashion outfit

Fashion multi-modal

Fashion DataSets

Retail & Product & Instance

Image Retrieval using Deep Hash

Video Classification

OCR - Recognition

OCR - Detection

Attention & Deformation

Visual & Textual Embedding

CNN

Transfer Learning

Generative Adversarial Nets

Face

Pose Estimation

NLP/NLU

Learning to Rank

Domain Adaptation

Curriculum Learning

Image Segmentation

Localization

AutoML

Image Quality

Others