ChuXiaokai / baidu_ultr_dataset

an unbias-learning-to-rank dataset of Baidu
60 stars 8 forks source link

How to use other features? #14

Open we1559 opened 1 year ago

we1559 commented 1 year ago

Hi, I find we only use query, title, abstract to train the model. But there is a lot of other features in dataset, including Continuous Value and Discrete Number.

How could I use these features to finetune an unbiased LTR model such as dla?

It seems very strange to concat the fc3 output of Transformer4Ranking/model.py and the original feature values.

Hope to get your reply.

zoulixin93 commented 1 year ago
  1. Other features, mainly the user behaviors and display features[1], can serve as strong signals for predicting a user's click, but they may also contain inherent bias. As a result, some studies have utilized these features to estimate relevance [2,3] with careful design. However, due to the long-tail distribution of user queries, it may be challenging to incorporate these features into the evaluation set (we have tried to join the query-document on the training and evaluation set and use the joined query-document pairs, which will influence the distribution of the dataset.).
  2. For extracting features for DLA, please refer to the GitHub repository: https://github.com/xuanyuan14/THUIR_WSDM_Cup. This winning solution has provided code for extracting DLA liking features.
  3. I do not quite understand the problem. The embedding generated by the Transformer fc3 can be viewed as "Continuous Value”. By the way, you can modify the structure as you needed.

[1] A Large Scale Search Dataset for Unbiased Learning to Rank. https://arxiv.org/abs/2207.03051 [2] Can clicks be both labels and features? Unbiased Behavior Feature Collection and Uncertainty-aware Learning to Rank. [3] Approximated doubly robust search relevance estimation.

2023年3月27日 19:11,we1559 @.***> 写道:

Hi, I find we only use query, title, abstract to train the model. But there is a lot of other features in dataset, including Continuous Value and Discrete Number.

How could I use these features to finetune an unbiased LTR model such as dla?

It seems very strange to concat the fc3 output of Transformer4Ranking/model.py and the original feature values.

Hope to get your reply.

— Reply to this email directly, view it on GitHub https://github.com/ChuXiaokai/baidu_ultr_dataset/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHPSGYBFBHUXSKT7XAYCQFTW6FYULANCNFSM6AAAAAAWJBF2WA. You are receiving this because you are subscribed to this thread.

we1559 commented 1 year ago

Thanks for your reply, I will try the THUIR_WSDM_Cup.

Actually, I'm confused about how to use other features such as tf-idf with query-title-abstract transformer model.

I could only design a model combine the emdedding of the transformer and the original value of other features. But it seems strange.

Could you give me some advice?

rowedenny commented 1 year ago

I think you may refer to "hybrid retrieval model", which combines both sparse retrieval, such as bm25, and dense retrieval, as the feature output from this repo.