This is our PyTorch implementation for our SIGIR 2022 long paper:
Xinyan Fan, Jianxun Lian, Wayne Xin Zhao, Zheng Liu, Chaozhuo Li and Xing Xie (2022). "Ada-Ranker: A Data Distribution Adaptive Ranking Paradigm for Sequential Recommendation." In SIGIR 2022. PDF
Environments:
python==3.8
pytorch==1.11.0
cudatoolkit==10.1
Install environments by:
pip install -r requirements.txt
You can download the processed ML10M data from this link, and put it in your dataset path.
data_process/
You can also use the pipeline in data_process/
to generate the processed ML10M dataset automatically. This pipeline includes:
Quick start by:
sh run_dataprocess.sh
In general, there are two steps of preparing final data sets that Ada-Ranker uses.
In this project, we provide an example of processing the original ML10M dataset. See more details in data_process/ml10m_prepare.py
For other datasets, you can use another script to get the input data, and its format should be like this:
user_id item_id cate_id timestamp
1 122 [5, 15] 838985046
139 122 [5, 15] 974302621
149 122 [5, 15] 1112342322
182 122 [5, 15] 943458784
215 122 [5, 15] 1102493547
217 122 [5, 15] 844429650
The input data should include 4 fields: 'user_id', 'item_id', 'cate_id', 'timestamp', and each element in `cate_id' is a list containing several categories of target item.
You only need to provide a .tsv
file containing the input data with the above 4 fields, and the programme will automatically process it to the final training sets that Ada-Ranker needs. See more details in data_process/preprocess.py.
The main output files include:
user_item_cate_time.tsv
is the user-item interaction file with item's category and action timestamp (after hashing). This file can be used to pre-train item embeddings by word2vec.
item_emb_64.txt
is optional to initialize item embedding table from a pre-trained embedding table (by word2vec, see more detail in Ada-Ranker/data_process/helper/word2vec.py)
train.pkl
, valid.pkl
and test.pkl
are needed to train the model and their structure are the same. Each pkl file is transferred from the DataFrame in the tsv file. See more in Ada-Ranker/data_process/helper/datasaver.py to know how to get '.pkl' files.
train.pkl
the data contains 6 fields which are processed previously: ['user_id', 'item_id', 'cate_id', 'item_seq', 'item_seq_len', 'neg_items'], and the DataFrame is like:
user_id item_id cate_id item_seq item_seq_len neg_items
2 36 [19] [32, 33, 34, 35] 4 [13816, 30633, 29780, 39149, 20546, 46865, 13353, 45664, 49311, 14805, 28765, 7435, 6579, 33844, 43311, 30097, 42826, 23042, 1624]
2 37 [19] [32, 33, 34, 35, 36] 5 [41, 12854, 13815, 20934, 3494, 21349, 17290, 12898, 26532, 1942, 3544, 7712, 26479, 1740, 46791, 13696, 3316, 15662, 30455]
2 38 [19] [32, 33, 34, 35, 36, 37] 6 [1360, 39105, 29735, 15763, 7595, 2777, 48139, 5405, 5317, 33184, 11442, 13402, 8480, 9657, 15475, 24955, 4643, 7752, 19465]
You can use the shell command to train the model (only need to change MY_DIR
and ALL_DATA_ROOT
)
sh run_train_base.sh
Train Ada-Ranker in an end-to-end way.
sh run_train_adaranker.sh
Load pre-trained base model, and finetune all parameters in Ada-Ranker (freeze=0
, set SAVED_MODEL_PATH
to the path of pre-trained base model):
sh run_finetune.sh
load pre-trained base model, and only finetune adaptation parameters in Ada-Ranker (freeze=1
, set SAVED_MODEL_PATH
to the path of pre-trained base model):
sh run_finetune.sh
provide a trained model, and infer on a specific test set.
sh run_inference.sh
See more details of main files in Main/
.
Output path will be like this:
AdaRanker/result/
- Ada-Ranker/
- GRU4Rec_ML10M_train/
- saved/
timestamp.log
- GRU4Rec_ML10M_finetune/
- saved/
timestamp.log
- Base/
- GRU4Rec_ML10M_train/
- saved/
timestamp.log
Data/get_data.py
train.pkl
, valid.pkl
and test.pkl
are needed to train the model and their structure are the same. Each pkl file is transferred from the DataFrame in the corresponding tsv file. (See more details in data_process/
to know how to prepare them.)
config/
All configuration files are in config/. 'overall.yaml' contains basic training settings. Parameter settings of all models are in config/model_config/. In config/dataset_config/, you need to set 'user_num' and 'item_num' in corresponding yaml files.
You can also set parameters directly in the command line, such as:
python Main/main_train.py --batch_size=1024
See more details in Utils/init_config.py
to know how to load all configurations.
Trainer/train.py
A batch data is organized in a dictionary - interaction
. When you change the fields in dataset, you also need to change this part.
Model/model.py
This framework includes 7 basic sequential recommender models: MF, GRU4Rec, SASRec, NARM, NextItNet, SRGNN, SHAN. Loss type is BCE loss. Prediction layer contains using a 2-layer MLP to predict scores.
To implement a new model, you need to complete functions _define_model_layers()
and forward()
in each class.
Evaluator/valid.py
Including basic ranking metrics: group_auc, ndcg, hit, mean_mrr.
Any scientific publications that use our codes and datasets should cite the following paper as the reference:
@inproceedings{Fan-SIGIR-2022,
title = {Ada-Ranker: A Data Distribution Adaptive Ranking Paradigm for Sequential Recommendation},
author = {Xinyan Fan and
Jianxun Lian and
Wayne Xin Zhao and
Zheng Liu and
Chaozhuo Li and
Xing Xie},
booktitle = {{SIGIR} '22: The 45th International {ACM} {SIGIR} Conference on Research
and Development in Information Retrieval, Madrid, Spain, July 11–15, 2022},
year = {2022},
publisher = {{ACM}},
doi = {10.1145/3477495.3531931}
}
If you have any questions for our paper or codes, please send an email to xinyanruc@126.com.