RL4RS is a real-world deep reinforcement learning recommender system dataset for practitioners and researchers.
import gym
from rl4rs.env.slate import SlateRecEnv, SlateState
sim = SlateRecEnv(config, state_cls=SlateState)
env = gym.make('SlateRecEnv-v0', recsim=sim)
for i in range(epoch):
obs = env.reset()
for j in range(config["max_steps"]):
action = env.offline_action
next_obs, reward, done, info = env.step(action)
if done[0]:
break
Dataset Download(data only): https://zenodo.org/record/6622390#.YqBBpRNBxQK
Dataset Download(for reproduction): https://drive.google.com/file/d/1YbPtPyYrMvMGOuqD4oHvK0epDtEhEb9v/view?usp=sharing
Paper: https://arxiv.org/pdf/2110.11073.pdf
Appendix: https://github.com/fuxiAIlab/RL4RS/blob/main/RL4RS_appendix.pdf
Kaggle Competition (old version): https://www.kaggle.com/c/bigdata2021-rl-recsys/overview
Resource Page: https://fuxi-up-research.gitbook.io/fuxi-up-challenges/
Tutorial: https://github.com/fuxiAIlab/RL4RS/blob/main/tutorial.ipynb
04/20/2023: SIGIR 2023 Resource Track, [Accept].
09/02/2022: We release RL4RS v1.1.0. 1) two additional RS datasets for comparison, Last.fm and CIKMCup2016; 2) two additional model-free baselines, TD3 and RAINBOW, and two additional model-based batch RL baselines, MOPO (Model-based Offline Policy Optimization) and COMBO(Conservative Offline Model-Based Policy Optimization). 3) BCQ and CQL support continuous action spaces.
09/17/2022: A hand-on Invited talk at DRL4IR Workshop, SIGIR2022.
12/17/2021: Hosting IEEE BigData2021 Cup Challenges, Track I for Supervised Learning and Track II for Reinforcement Learning.
RL4RS supports Linux, at least 64 GB Mem !!
$ git clone https://github.com/fuxiAIlab/RL4RS
$ export PYTHONPATH=$PYTHONPATH:`pwd`/rl4rs
$ conda env create -f environment.yml
$ conda activate rl4rs
Dataset Download: https://drive.google.com/file/d/1YbPtPyYrMvMGOuqD4oHvK0epDtEhEb9v/view?usp=sharing
.
|-- batchrl
| |-- BCQ_SeqSlateRecEnv-v0_b_all.h5
| |-- BCQ_SlateRecEnv-v0_a_all.h5
| |-- BC_SeqSlateRecEnv-v0_b_all.h5
| |-- BC_SlateRecEnv-v0_a_all.h5
| |-- CQL_SeqSlateRecEnv-v0_b_all.h5
| `-- CQL_SlateRecEnv-v0_a_all.h5
|-- data_understanding_tool
| |-- dataset
| | |-- ml-25m.zip
| | `-- yoochoose-clicks.dat.zip
| `-- finetuned
| |-- movielens.csv
| |-- movielens.h5
| |-- recsys15.csv
| |-- recsys15.h5
| |-- rl4rs.csv
| `-- rl4rs.h5
|-- exactk
| |-- exact_k.ckpt.10000.data-00000-of-00001
| |-- exact_k.ckpt.10000.index
| `-- exact_k.ckpt.10000.meta
|-- ope
| `-- logged_policy.h5
|-- raw_data
| |-- item_info.csv
| |-- rl4rs_dataset_a_rl.csv
| |-- rl4rs_dataset_a_sl.csv
| |-- rl4rs_dataset_b_rl.csv
| `-- rl4rs_dataset_b_sl.csv
`-- simulator
|-- finetuned
| |-- simulator_a_dien
| | |-- checkpoint
| | |-- model.data-00000-of-00001
| | |-- model.index
| | `-- model.meta
| `-- simulator_b2_dien
| |-- checkpoint
| |-- model.data-00000-of-00001
| |-- model.index
| `-- model.meta
|-- rl4rs_dataset_a_shuf.csv
`-- rl4rs_dataset_b3_shuf.csv
# move simulator/*.csv to rl4rs/dataset
# move simulator/finetuned/* to rl4rs/output
cd reproductions/
# run exact-k
bash run_exact_k.sh
# start http-based Env, then run RLlib library
nohup python -u rl4rs/server/gymHttpServer.py &
bash run_modelfree_rl.sh DQN/PPO/DDPG/PG/PG_conti/etc.
cd reproductions/
# first step, generate tfrecords for supervised learning (environment simulation)
# is time-consuming, you can annotate them firstly.
bash run_split.sh
# environment simulation part (need tfrecord)
# run these scripts to compare different SL methods
bash run_supervised_item.sh dnn/widedeep/dien/lstm
bash run_supervised_slate.sh dnn_slate/adversarial_slate/etc.
# or you can directly train DIEN-based simulator as RL Env.
bash run_simulator_train.sh dien
# model-free part (need run_simulator_train.sh)
# run exact-k
bash run_exact_k.sh
# start http-based Env, then run RLlib library
nohup python -u rl4rs/server/gymHttpServer.py &
bash run_modelfree_rl.sh DQN/PPO/DDPG/PG/PG_conti/etc.
# offline RL part (need run_simulator_train.sh)
# generate offline dataset for offline RL first (dataset_generate stage)
# generate offline dataset for offline RL first (train stage)
bash run_batch_rl.sh BC/BCQ/CQL
algorithm | category | support mode | |
---|---|---|---|
Wide&Deep | supervised learning | item-wise classification/slate-wise classification/item ranking | |
GRU4Rec | supervised learning | item-wise classification/slate-wise classification/item ranking | |
DIEN | supervised learning | item-wise classification/slate-wise classification/item ranking | |
Adversarial User Model | supervised learning | item-wise classification/slate-wise classification/item ranking | |
Exact-K | model-free learning | discrete env & hidden state as observation | |
Policy Gredient (PG) | model-free RL | model-free learning | discrete/conti env & raw feature/hidden state as observation |
Deep Q-Network (DQN) | model-free RL | discrete env & raw feature/hidden state as observation | |
Deep Deterministic Policy Gradients (DDPG) | model-free RL | conti env & raw feature/hidden state as observation | |
Asynchronous Actor-Critic (A2C) | model-free RL | discrete/conti env & raw feature/hidden state as observation | |
Proximal Policy Optimization (PPO) | model-free RL | discrete/conti env & raw feature/hidden state as observation | |
Behavior Cloning | supervised learning/Offline RL | discrete env & hidden state as observation | |
Batch Constrained Q-learning (BCQ) | Offline RL | discrete env & hidden state as observation | |
Conservative Q-Learning (CQL) | Offline RL | discrete env & hidden state as observation |
algorithm | discrete control | continuous control | offline RL? |
---|---|---|---|
Behavior Cloning (supervised learning) | :white_check_mark: | :white_check_mark: | |
Deep Q-Network (DQN) | :white_check_mark: | :no_entry: | |
Double DQN | :white_check_mark: | :no_entry: | |
Rainbow | :white_check_mark: | :no_entry: | |
PPO | :white_check_mark: | :white_check_mark: | |
A2C A3C | :white_check_mark: | :white_check_mark: | |
IMPALA | :white_check_mark: | :white_check_mark: | |
Deep Deterministic Policy Gradients (DDPG) | :no_entry: | :white_check_mark: | |
Twin Delayed Deep Deterministic Policy Gradients (TD3) | :no_entry: | :white_check_mark: | |
Soft Actor-Critic (SAC) | :white_check_mark: | :white_check_mark: | |
Batch Constrained Q-learning (BCQ) | :white_check_mark: | :white_check_mark: | :white_check_mark: |
Bootstrapping Error Accumulation Reduction (BEAR) | :no_entry: | :white_check_mark: | :white_check_mark: |
Advantage-Weighted Regression (AWR) | :white_check_mark: | :white_check_mark: | :white_check_mark: |
Conservative Q-Learning (CQL) | :white_check_mark: | :white_check_mark: | :white_check_mark: |
Advantage Weighted Actor-Critic (AWAC) | :no_entry: | :white_check_mark: | :white_check_mark: |
Critic Reguralized Regression (CRR) | :no_entry: | :white_check_mark: | :white_check_mark: |
Policy in Latent Action Space (PLAS) | :no_entry: | :white_check_mark: | :white_check_mark: |
TD3+BC | :no_entry: | :white_check_mark: | :white_check_mark: |
See script/ and reproductions/.
RLlib examples: https://docs.ray.io/en/latest/rllib-examples.html
d3rlpy examples: https://d3rlpy.readthedocs.io/en/v1.0.0/
See reproductions/.
bash run_xx.sh ${param}
experiment in the paper | shell script | optional param. | description |
---|---|---|---|
Sec.3 | run_split.sh | - | dataset split/shuffle/align(for datasetB)/to tfrecord |
Sec.4 | run_mdp_checker.sh | recsys15/movielens/rl4rs | unzip ml-25m.zip and yoochoose-clicks.dat.zip into dataset/ |
Sec.5.1 | run_supervised_item.sh | dnn/widedeep/lstm/dien | Table 5. Item-wise classification |
Sec.5.1 | run_supervised_slate.sh | dnn_slate/widedeep_slate/lstm_slate/dien_slate/adversarial_slate | Table 5. Item-wise rank |
Sec.5.1 | run_supervised_slate.sh | dnn_slate_multiclass/widedeep_slate_multiclass/lstm_slate_multiclass/dien_slate_multiclass | Table 5. Slate-wise classification |
Sec.5.1 & Sec.6 | run_simulator_train.sh | dien | dien-based simulator for different trainsets |
Sec.5.1 & Sec.6 | run_simulator_eval.sh | dien | Table 6. |
Sec.5.1 & Sec.6 | run_modelfree_rl.sh | PG/DQN/A2C/PPO/IMPALA/DDPG/*_conti | Table 7. |
Sec.5.2 & Sec.6 | run_batch_rl.sh | BC/BCQ/CQL | Table 8. |
Sec.5.1 | run_exact_k.sh | - | Exact-k |
- | run_simulator_env_test.sh | - | examining the consistency of features (observations) between RL env and supervised simulator |
Any kind of contribution to RL4RS would be highly appreciated! Please contact us by email.
Channel | Link |
---|---|
Materials | Google Drive |
Issues | GitHub Issues |
Fuxi Team | Fuxi HomePage |
Our Team | Open-project |
@article{2021RL4RS,
title={RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender System},
author={ Kai Wang and Zhene Zou and Yue Shang and Qilin Deng and Minghao Zhao and Runze Wu and Xudong Shen and Tangjie Lyu and Changjie Fan},
journal={ArXiv},
year={2021},
volume={abs/2110.11073}
}