2018 Google Landmark Retrieval Challenge 리뷰

Google Landmark Classification Challenge 와 헷갈리지 마세요~

1th

구조
- 앙상블한 feature 적용
  - 그것도 아주 무식하게 concat해서
    - XG = [2× ResNeXt+REMAP; 1.5× ResNeXt+RMAC; 1.5× ResNeXt+MAC; 1.5× ResNeXt+SPoC; ResNet+MAC; ResNet+REMAP]
  - 사용된 feature만 6개
    - ResNeXt(REMAP , R-MAC, MAC, SPoC)+ResNet(REMAP, MAC)
      - REMAP - 자신들이 만든것같음(공걔예정)
  - attention type의 feature는 없음.
- concat feature에 대한 normalize+whitening은 이제 default
- 최종적으로 4096-dimensional descriptor 적용 - 앗..PCA했다며 ㅜ
- QE와 DB에대한 argumentation으로 성능향상을 꽤 했음.
- 학습셋은 neuralcode의 nosiy한 landmark 셋( 120k images of 650 famous landmarks)을 이용하여 clean 버전으로 만든다음 이용.(이는 DIR, delf에서도 마찬가지)
간단히 표현한다면
1th 에 사용한 REMAP Feature 에 대한 정체 힌트 https://twitter.com/ducha_aiki/status/1008833406036107270
multi-conv layer(계층별)에서 roi 별 aggregate한 feature를 만드는듯
- multi-layer r-mac이라고 보일수도 있음.
- 다만, 과정에서 Entropy Weighting이라는 부분이 있는데 이는 KL-divergence를 이용하여 power up한다고 했는데..이부분이 자세히 안나와서..

remap 논문 : REMAP: Multi-layer entropy-guided pooling of dense CNN features for image retrieval

Practically, the KL-divergence Weighting (KLW) block in the REMAP architecture is implemented
using a  convolutional layer with weights initialized by the KL-divergence values and optimized
using Stochastic Gradient Descent (SGD) on the triplet loss function.

...
Our novel component KL-divergence weighting (KLW) can be implemented using 1D convolutional 
layer, with weights than can be optimized.

2th

https://layer6.ai/wp-content/uploads/2018/11/Google-Landmark-2018-Poster-1.pdf

3th

We handle the task as instance search due to the low intra-class variability and the large number of classes.
GeM descriptor 이용
- gem
- knn classifier
DELF 적용 & large scale search : https://github.com/chullhwan-song/Reading-Paper/issues/4
관련 PPT : Copy.of.VRG_Prague.pdf

4th - NaverLabs Europe

https://github.com/chullhwan-song/Reading-Paper/issues/17 이 연구가 base > 기존 global descriptor로 써 이용한것중에 SOTA 논문임.
웃긴건 R-MAC 아니라, triplet GeM이란는것
개인적으로는 base network를 하나만 사용했다는 점에서 1위가 아닐까?ㅎ 여러 base network를 사용하여 1등한다는게 실 서비스 측면에서 가능한가?? 큰 의문이 있음(개인적으로 불가능..)
base (R-MAC)논문에서 256/512 dim feature를 이용했다는데, 여기서는 2k dim임 > 이것으로 feature dim이 클수록 더 좋아짐을 예측..할수 있음.(DL 이전에 이런측면이 매우 강했음)
- 논문저자와 상관없나??
  4위 landmark-recognition 리뷰 - 주의) Retrieval 이 아님.

Step 1: Training a usual CN

네트워크 구성 : ResNet34, ResNet50, ResNet101, ResNet152, ResNeXt101, InceptionResNetV2, DenseNet201.
15k classes
augmentation
- Random 224x224 crops
- resizes, scales, shifts, rotates, flips
Pavel Pleskov
- CNN's were training using fast.ai until accuracy on validation does not become 0.975.
- He did not use classes with less than 10 samples (the number of remaining classes was eight thousand)
- However, it is not enough to train a lot of CNN's and merge their predictions in this competition. There are a lot of challenges:
  - Classes with one or two samples in the train dataset > 1~2개는 너무 적다? Few-shot learning ?
  - Images from Google Landmark Retrieval Challenge in the test dataset
  - Non-landmark images in the test dataset > 없는데 어떻게 찾아??
    Step 2: Recognizing landmarks from retrieval challenge
retrieval 챌린지데이터를 이용했다는..
Step 3: Recognizing images with no landmark
따로 랜드 vs 비랜드 분류기를 만듦
Step 4: Few-shot learning

"Learning Robust Visual-Semantic Embeddings" 연구에서 이런 개념을 저는 들은적이 있음, 즉, Classes with one or two samples in the train dataset 의 개념에서 클래스당 이미지 개수가 1~2개정도의 적은수일 때, 어떻게든(??) 학습이 가능하도록하는 개념.

1. Extract features from hidden layer
2. For each image from test set find K closest images from train set (K=5)
3. For each class_id we computed: scores[class_id] = sum(cos(query_image, index) for index in K_closest_images)
4. For each class_id we normalized its score: scores[class_id] /= min(K, number of samples in train dataset with class=class_id)
5. label = argmax(scores), confidence = scores[label]

Step 5: kNN with features from local crops

We extracted 100 augmented local crops from each image. Crops with no landmarks were removed (using CNN from step 3)
- 한이미지에서 100개의 crop한후, landmark가 아닌것은 제외하여 ROI를..헐...
  Merging

위에서 언급한 각가의 network 모델을 통합 그리고 다음과같은 휴이스틱 rules

1. Compute confidence score for each label using predictions from steps 1-4 as follow: score[label] = label_count / models_count + sum(label_confidence for each model) / models_count. Here label_count is a number of models where the prediction with max confidence is equal to the label.
2. We also used each prediction from step 5 with confidence = 1 + confidence_from_step_5 / 100

6 th

CNN-based global descriptors trained on the training set of the recognition
- GEM 사용 > 참고 Google Landmark Recognition Challenge 19등한 팀도 사용하였음.
  - "Fine-tuning CNN Image Retrieval with No Human Annotation" 의 연구인 generalized average pooling 기법을 적용하여함. > 같은 팀은 아니겠지만, 2번이나 나왔으니 잘 살펴볼 필요있는 논문(?) 인듯
  - PCA whitening : supervised whitening 용어는https://oss.navercorp.com/vl/NextImageSearch_Study/issues/338 연구를 기반으로 한듯으로 보임.(자세히 살펴보겠음)
train > We randomly sample at most 50 positive pairs per landmark, while the hard negatives are re-mined for each epoch of the training.
- https://github.com/filipradenovic/cnnimageretrieval-pytorch
feature extraction
- we perform descriptor extraction at 3 scales (scaling factors of 1, 1/sqrt(2), sqrt(2)), sum-aggregate the descriptors, and l2 normalize. The PCA whitening with shrinkage is learned/applied on the multi-scale descriptors.
- 내가 보기엔 이부분은 r-mac으로 보임.
- final descriptor is 2048
Search
- query expansion
- KNN

8th Etri

9th Naver

10 th

CNN: ResNet, Wide ResNet, Inception-v3, DenseNet - imagenet 기반 pretrained model > The best result : ResNet-101 pretrained on ImageNet.
- augmentation : random resized crops, color jittering, horizontal flipping , random resizing 224x224 crops
Loss : metric learning, ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Inference - softmax문제로 풀지 않았기 때문에.
- 학습셋중에 random하게 100개 선택
- 이것에 대한 vector를 추출하고 mean vector를 구함.
- 이 vector에 대해 normalize
- 최종적으로 cosine distance 적용
Ensembles
- 5 folds training set > 5개 모델 생성
- 이들 모델 결과에 대한 voting

14th - 4 main steps : 링크

finetuned ImageNet-pretrained PyTorch ResNet50 > 이때 Landmark Recognition Challenge 데이터셋을 적용하여 학습.
- 이때 사용되는 알고리즘은 https://github.com/filipradenovic/cnnimageretrieval-pytorch & Fine-tuning CNN Image Retrieval with No Human Annotation 에서 적용되어 있는 GeM pooling layer > 아직 study에 올리고 안봄.
- triplet loss
이렇게 생성된 featrue를 KNN 이용 - 여기서는 facebook라이브러리
top-100 에 대해 local descriptor + ransac verification > sift 기반 매칭을 사용한것같음.
query expansion
- Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations

17 th

trieplet loss에 의한 embedding feature
embedding feature indexing - knn
시스템 구조
소스 : https://github.com/toshi-k/kaggle-google-landmark-retrieval-challenge

19th

2 stage
1. landmark vs non-landmark. 분류기 생성 1-1. Xception(with sigmoid activation layer)모델을 적용하여 fine-tuned > landmark vs non-landmark. 분류기 생성 1-2. "Fine-tuning CNN Image Retrieval with No Human Annotation" 의 연구인 generalized average pooling 기법을 적용하여함.
2. Google DELF features
소스 : https://github.com/jandaldrop/landmark-recognition-challenge/

chullhwan-song / Reading-Paper

2018 Google Landmark Retrieval Challenge 리뷰 #105

1th

remap 논문 : REMAP: Multi-layer entropy-guided pooling of dense CNN features for image retrieval

2th

3th

4th - NaverLabs Europe

4위 landmark-recognition 리뷰 - 주의) Retrieval 이 아님.

Step 1: Training a usual CN

Step 2: Recognizing landmarks from retrieval challenge

Step 3: Recognizing images with no landmark

Step 4: Few-shot learning