Context-Aware Visual Compatibility Prediction

Abstract

패션관련 추천
어떻게 두개 이상의 fashion item을 가지고 잘 어울리는지(compatible) 또는 비주얼 옷차림(visually appealing)을 결정하는지에 문제에 대해~
시간/장소/social attitudes에 비친 개인적 선호도등이 bias
이 연구는 그들의 context 뿐만 아니라, 그들의 비주얼적인 속성들에 기반하여, 두개의 item 사이에 적합도를 예측하는것. > 두개의 item이 나에게 얼마나 잘 어울리는지 그렇지 않는지?
그래서, 이들 item사이에 대해 각각 적합한지를 알기 위한 그 product에 대해 context를 정의.
제안하는 모델은 item feature 사이의 pairwise 비교를 통한 metric learning이 아니다.
즉, 그들의 context에 조건이 되는 "product embeddings" 이 되도록 학습하는 graph neural network를 이용하여 이러한 문제를 푼다.

Proposed Method

구조적인 정보로 그래프를 이용하여 더 나은 " product embeddings"를 획득.
제안된 모델은 "graph auto-encoder (GAE) framework " 기반
- paper : "Variational graph auto-encoders"
encoder는 불완전한 graph 입력으로 받음. 그리고 각 node의 embedding을 진행.
그러면, "node embedding"은 graph안의 잊어버린(missing) edge를 예측하기 위해 decoder로써 사용한다.
undirected Graph :
- N개의 Vertex/Node :
- Edge : > Vertex i와 j로 연결된 Edge
Graph안의 모든 node feature vector :
- i번째 node vector :
Graph는 adjacency matrix로 재표현됨 :
- A_(i,j)가 존재하면 1, 그렇지 않으면, 0
모델은 encoder - decoder 구조
- encoder H =
- decoder A =
여기서, encoder의 역할은 feature X를 받아 representation(latent?) H로 전환하고,
- H의 형태는 X와 유사 즉, i번째 노드에 대한 encoder된 feature vector H_i
  - 이런걸로 볼때, input의 Graph 형태를 그대로 유지되는듯으로 보임.
- the distance between two points can be mapped to the probability of whether or not an edge exists between them. > Edge에 대한 존재 여부에 대한 확률
dcoder는 새로운 구조를 정의 하는 adjacency matrix A 로 전환된다. encoder의 A와는 다른듯..(초기 A? 바로 위에서 설명한 형태인듯..)
- to compute this probability using the features of each node > 각노드에 대한 feature의 prob. (embedding되었으면..)
- 즉,
  - node(items) i와 j 사이에 적합성(compatibility)를 재표현한다는 의미(최종목적)
그래서 정리하면
- Encoder는 Graph Convolutional Network에 의해
- Decoder는 두 product > pair(i, j)사이에 적합성 점수(compatibility score)를 예측하기 위한 metric learning
전체 알고리즘 그림

Encoder

단일 node i에서의 관점에서 보면, encoder는 초기 feature x_i가 h_i로 재표현되는 case이다.
초기 feature x_i는 CNN feature > 이는 상품에 대한 속성정보만 가지고 있다.
하지만 우리는 이러한 상품속성 정보뿐만 아니라, 구조적인 정보로 재표현 되길 원한다.by encoder
다시 말하면, node i에 대한 정보뿐만 아니라 이웃노드들 에 대한 정보도 포함되어 새롭게 representation되길 원함(embedding)
그래서, Encoder는 하나의 노드 주위에 local neighbourhood를 통합(aggregates)하는 역할을 하는 function.
그래서, 이 function은 바로 Graph Convolutional Network (GCN)에 의해 구현
single layer는
- z는 hidden activation
- 는 l layer에서의 i번째 input 이면, 당연히 output은
이것을 이함수가 matrix form과 graph안의 모든 노드에서 작동되는 형태인 function은,
- : first layer
- normalized 된 s step에서의 adjacency matrix
  - self-connections()포함
  - self-connections를 가진 첫번째 step 이웃를 포함
  - 로 하자, 이는 diagonal degree matrix()를 이용한 row-wise하여 normalize한다는 의미(좀 소스를 봐야할듯..ㅠ)
  - Context information는 이는 학습동안 고려되어진 neighbourhood의 depth로 재표현된 parameter S에 의해 조절된다.
    - 노드 i의 depth s에 있는 neighbourhood은 i에서 최대(at most) s 인 거리(이동 한 에지의 수)에 있는 모든 노드들의 집합
      - the neighbourhood at depth s of node i is the set of all nodes that are at distance (number of edges traveled) at most s from i.
    - 실제 실험에서는 S=1 > 모든 layer에서 1 depth만 적용.
  - 는 l layer를 위한 학습 parameter를 포함하는 matrix
  - 각 layer에서 적용 - batchnormal, dropout, weight regularization

Decoder

decoder 는 두 노드사이에 연결에 관한 확률을 계산하는 function이다.
이를 위해, metric learning를 적용.
- 여기서 주의할점은 similarity & compatibility 서로 다르다.!!
  - similarity는 두 셔츠사이의 유사성 - color/shape..
  - compatibility는 두 아이템이 서로 잘 어울리를 나타냄.
metric function
- 이것의 output : 0~1 사이의 값 = probability p
decoder function의 output인 probability p
- 절대값.
- 과 b는 학습 parameter
- sigmod f 그래서, 0~1 사이의 output

chullhwan-song / Reading-Paper

Context-Aware Visual Compatibility Prediction #109

Abstract

Proposed Method

Encoder

Decoder

Training

Experimental

Task

Datasets

실험 결과