Image Retrieval using Multi-scale CNN Features Pooling

Abstract

NetVLAD(review, paper 참고)의 귀환인가?ㅎ
- 참고로, Detect-to-Retrieve: Efficient Regional Aggregation for Image Search(review, paper 참고) 의 연구에서는, NetVLAD의 concept를 준 deep featuere + 고전(??) vlad기반으로 SOTA를 기록한 연구임
(참고) 이전연구 비교표에선, 그닦(?) SOTA는 거리와 멀었음.
- 이 연구들의 실험 결과 참고 > https://github.com/chullhwan-song/Reading-Paper/issues/153, https://github.com/chullhwan-song/Reading-Paper/issues/19
특징
- end-to-end trainable network architecture
- NetVLAD + multi-scale local pooling
- triplet loss
이러한 특징으로 3가지 데이터셋(Holidays, Oxford5k, Paris6k)에 대해 SOTA
관련 코드는 공개 안했음.

The Proposed Method

the main differences
- 각각의 2-scales local features's max-pooling를 결합한 > NetVLAD
  - 2x2 & 3x3 max sampling
    - 여기서 주의할점은, 2 scale은 multi-layer feature가 아니라, 1 scale layer로 input으로 하여 2x2 & 3x3를 적용한 의미하는 것같다.(그림1를 보면)
- hard+semi-hard sampling 기반의 triplet loss > 이러면 극도의 hard sample를 제거하는 목적
이에 대한 구조

Pooling of local CNN features

max-pooling할때, kernel 크기를 2x2 & 3x3로 함. (condition stride=1)
- 좀더 정교하고 세밀한 representation 획득하기 위해 ( so to obtain representations at finer and larger detail.)
- 이 의미는 한 layer로부터 커널의 크기가 다른 activation function을 둔다라는 의미.
최종적으로 1 × 1 × f “column feature” 형태의 feature 구성을 위해라는데, 이 의미는 local descriptor에 대한 global descriptor로 convert하는 형태를 의미하는 것 같다. > 이런 과정을 aggregated ..,이라고 표현
이는 위의 두개의 feature를 concat하는 형태로 multi-scale descriptor 로 표현한다.
- 보통 kernel 관련 변화는 context라고 하는데, 여기서 convolution 연산이 아닌 max-pooling연산이라 그런가??(좀 고민해봐야함.)
aggregated : 여러개의 local descriptor > global descriptor로..
- 고전적인(old ??) vlad 개념으로 부터 차용한 DL 기반의 NetVLAD로 구성한다.
  - NetVLAD에 관해서는 review, paper 참고의 paper와 엉망징창인 리뷰를 참고..
  - bag of visual word 구성을 위해 K=64인 K-Means구성하고 32k-D(이부분은 더 찾아봐야할듯..)구성
    - 32768(32k-D) = 512(사용하는 마지막 feature map, channel 수)x64
    - MIRFLICKR 데이터 셋를 이용하여 구성했는데, 제가 알기론 NetVLAD는 K-Measn를 사용하지않고 개념만 차용만 하는데...그냥 vlad인가?? 좀더 봐야할듯~~~
  - backbone - VGG16
  - conv feature map 에 대한 local descriptor 개념을 설명
  - 전체 VLAD 과정은

Training and Triplet Mining

먼저, Fig.1 참고
sampling 전략
- 보통 3가지
  - easy triplets >별로 성능향사에 도움이 안된다고 판단.
  - semi-hard triplets
  - hard triplets
- 그래서, 이 두개를 상황에 따라 semi-hard & hard triplets 조합해서 사용.
  - 두 가지중에 선택하는 방법은,
    - query와의 거리를 계산을 통해, 이 ranking중에 j번째에서 처음으로 negative image가 나왔을 경우, 바로 앞의 rank j-1이, positive image가 아니라면, semi-hard triplets , 반대이면, hard triplets
    - 여기서, positive image가 query와 너무 멀다면, overfitting 또는 poor generalization이 일어날수 있으므로 이 샘플은 제외!!!!!!!
trainset : Google Landmark V2 dataset : 데이터 링크
- cleaned 버전 이용: 1,580,470 images & 81,313 labels.
Adam
lr : 10−5 > decreased to 10−6
image size > 336×336

Experiments

evaluation set
- 앞에서 언급했듯이, Oxford5k, Paris6k, INRIA Holidays dataset
  - 최근 Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking 에서 언급한 set들을 같이 사용하는것 같은데 이 부분은 빠져있음.
실험 결과
- 224 + 336 + 504 > multi-scale images's multi-resolution
  - 주의) 앞에서 말한것도 결이 다르다.
- mAP
- 이 모든 테스트는 VGG의 결과만의 case!!!

chullhwan-song / Reading-Paper

Image Retrieval using Multi-scale CNN Features Pooling #372

Abstract

The Proposed Method

Pooling of local CNN features

Training and Triplet Mining

Experiments