Open chullhwan-song opened 5 years ago
(1) data reader (2) transform function (3) scoring function (4) ranking loss functions (5) evaluation metrics (6) ranking head (7) a model_fn builder
For example:
1 qid:10 32:0.14 48:0.97 51:0.45
0 qid:10 1:0.15 31:0.75 32:0.24 49:0.6
2 qid:10 1:0.71 2:0.36 31:0.58 51:0.12
0 qid:20 4:0.79 31:0.01 33:0.05 35:0.27
3 qid:20 1:0.42 28:0.79 35:0.30 42:0.76
In the above example, the dataset contains two queries. Query "10" has 3
documents, two of which relevant with grades 1 and 2. Similarly, query "20"
has 1 relevant document. Note that query-document pairs may have different
sets of zero-valued features and as such their feature vectors may only
partly overlap or not at all.
3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A
2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # 2B
1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D
2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # 3A
3 qid:3 1:1 2:1 3:0 4:0.3 5:0 # 3B
4 qid:3 1:1 2:0 3:0 4:0.4 5:1 # 3C
1 qid:3 1:0 2:1 3:1 4:0.5 5:0 # 3D
the following set of pairwise constraints is generated (examples are referred to by the info-string after the # character):
1A>1B, 1A>1C, 1A>1D, 1B>1C, 1B>1D, 2B>2A, 2B>2C, 2B>2D, 3C>3A, 3C>3B, 3C>3D, 3B>3A, 3B>3D, 3A>
* 2th: query 정보
* 한개의 query에 여러개가 있는 것을 알수 있는데, 그것은 라벨을 보고 랭킹정보를 알수 있고, 위에서 설명한거와 같이, 한문서만 표현한 케이스
* 3th~: feature
* index:value인데 zero이면 제거할수 있다.
### Feature Transformation with transform_fn
* Fig1. transform_fn
* spare feature(word or ngram 형태를 의미) > dense feature(w2v같은 embedding feature) 로 전환
* dense 2-D : context
* 3-D tensors : per item features
![image](https://user-images.githubusercontent.com/40360823/60855833-27273b80-a240-11e9-9067-28db31472624.png)
### Feature Interactions using scoring_fn
* 실제 network 부분
* 여기서는 3-layer feedforward neural network with ReLUs 예제.
![image](https://user-images.githubusercontent.com/40360823/60933668-45537100-a2fe-11e9-9b48-1a90992519ee.png)
### Ranking Losses
* The loss key is an **enum** over supported loss functions
![image](https://user-images.githubusercontent.com/40360823/60935605-64a1cc80-a305-11e9-9c24-397f22ea9fba.png)
### Ranking Metrics
* evaluation - NDCG 예
![image](https://user-images.githubusercontent.com/40360823/60942380-64153000-a31d-11e9-8c94-ac9a66e44b21.png)
### Ranking Head
* 앞서 설명한 losses & metrics에 대한 wrapper같음.
![image](https://user-images.githubusercontent.com/40360823/60942456-a3dc1780-a31d-11e9-9dcc-1a7144c7f590.png)
### Model Builder
* main정도로...
![image](https://user-images.githubusercontent.com/40360823/60942487-c5d59a00-a31d-11e9-9dfa-1c78389a8d48.png)
## USE CASES
* 현재 tf-ranking가 적용된 구글 서비스
* Gmail search
* Google Drive안에서의 document recommendation
* 이들 서비스는 엄청난 click log 데이터를 기반으로 학습
* RankLib보다 좋다.
* 게다가 Gmail 서비스에서는, 원래 잘 할수 없는 "sparse textual features"를 잘 적용되도록 모델을 만듦.
* sparse textual feature > 는 개인적인 유추해보는데...매우 dictionary가 크거나, 희박한...단어..
### Gmail Search
* Gmail는 검색로그를 기반으로 학습. > clicks
* 익명으로 선택..
* dense 와 sparse features 두개를 구성
* dense
* sparse : word- and character-level n-grams
* 250M queries
* Losses & metrics 는 "weighted by Inverse Propensity Weighting"
### Document Recommendation in Drive
* user click data
https://arxiv.org/abs/1812.00073 https://github.com/tensorflow/ranking