NetVLAD: CNN architecture for weakly supervised place recognition

what ?

image feature, global descriptor
using weakly labelled data - only geo tagging
learnable-vlad = NetVLAD : 기존 hand-craft한 feature 알고리즘인 VLAD를 deep learning을 통한 learnable feature로 탈바꿈
weakly labelled data에 adapt한 triplet ranking loss

NetVLAD

geo-tagging 된 데이터만을 가지고 landmark 인식을 할수 있는가..?란 주제.
기존 VLAD를 학습 가능하도록 만든개념이 NetVLAD
- centroid vector, assignments 등등
VLAD는
- bag of word의 형태 > k-means 에의 해 K개의 centroid 생성.
- 같은(가장 거리가 가까운) centroid의 cell 안에 들어오는 descriptor에 대한 모든 residual vector(centroid k 차원의 descriptor)를 합한것
이를 network안에서 학습할수 있도록 구성 == NetVLAD
- cnn local descriptor 추출 (conv feature map)
- vlad는 hard assignment인데 이를 soft assignment으로 바꾸도록한다.
  - cnn local descriptor : WxHxD, D는 채널
  - 그래서, WxH개의 D dimension을 가진 feature들이다.
- soft assignment는 이를 입력으로 받고, Conv를 이용한다. 이때, WxHxD > 1x1xDxK 로전환한다.
  - 이는 centroid를 구성하려것이다. D차원의 K개의 centorid (Kmeans cluter K를 생각하면 됨)
- 실제 가중치는 원래는 centroid 와 descriptor간의 distance를 구해야하는데, soft assignment의 개념을 생각해볼 때, 각 centorid에 대한 가중치를 softmax형태로 구하는 형태이다.
  - wx+b는 conv(w, b)의 의미
  - VLAD는 hard assignment 이므로 이부분이 0 아니면 1 이었다.
- VLAD core(c)는
  - 최종적으로 VLAD의 마찬가지로, NetVLAD도 K개의 D 차원을 가진 vector(KxDx1)를 output한다.
    triplet ranking loss
weakly labelled data에 adapt한 triplet ranking loss
close by location
즉, 가까운 거리에 있는 이미지는 먼거리에 있는 이미지보다 유사(Euclidean distance)하길 바란다. < 이게 이 논문의 가정인듯~ 사실 그렇지 않을 확률이 높으니..정말 가까운거리(동일한 건물?)를 찾아야할듯~
그래서, 학습을 위한 triplet 에 대한 ranking loss는
- q : query, 밀접한 이미지, 먼 이미지
Google Street View Time Machine data 로부터,
- 획득
  - : a set of potential positives
    - set of potential positives contains at least one positive image that should match the query, but we do not know which one.> 적어도 하나면 완전히 noise한 set ??인것 아닌가?? 더구나 없을수도??
  - : set of definite negatives
best matching potential positive image
- best matching potential positive image ![image]는 모든 "set of definite negatives"보다 작아야한다.
weakly supervised ranking loss
- - j : 즉, sum of individual losses for negative images
    실험
Datasets : Pittsburgh (Pitts250k), Tokyo 24/7, TokyoTM등 이용.
Evaluation metric : if at least one of the top N retrieved database images is within d = 25 meters from the ground truth position of the query.

실험
Datasets : Pittsburgh (Pitts250k), Tokyo 24/7, TokyoTM등 이용.
Evaluation metric : if at least one of the top N retrieved database images is within d = 25 meters from the ground truth position of the query.
Implementation details
- vgg, alexnet
- K = 64 resulting in 16k and 32k-D image representations for the two base architectures,
- Dimensionality reduction : PCA 4096-D

chullhwan-song / Reading-Paper

NetVLAD: CNN architecture for weakly supervised place recognition #3

what ?

NetVLAD

triplet ranking loss

실험

실험