Quick-start for ranking with Merlin Models

gabrielspmoreira commented 1 year ago

This PR is a port of the #988 PR that was originally in models repo, then ported here as the quick-start involves different Merlin libraries: NVTabular, models, and systems

Fixes #916 , fixes #986 , fixes #918, fixes #680, fixes #681, fixes #666

Goals :soccer:

This PR introduces a quick-start example for preprocessing, training, evaluating and deploying ranking models. It is composed by a set of scripts and markdown documents. We use in the example the TenRec dataset, but the scripts are generic and can be used with customer own data, provided that they have the right shape: positive and potentially negative user-item events with tabular features.

Implementation Details :construction:

preprocessing.py - Generic script for preprocessing with CLI arguments for preprocessing a raw dataset (CSV or parquet) with NVTabular. It contains arguments to configure input path and format, categorical and continuous features, configuring the features tagging (user_id, item_id, ...), to filter interactions by using min/max frequency for users or items and dataset split.
Example command line for TenRec dataset:

python preprocessing.py --input_data_format=csv --csv_na_values=\\N --input_data_path /data/QK-video.csv --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --min_user_freq 5 --persist_intermediate_files --dataset_split_strategy=random --random_split_eval_perc=0.2

ranking_train_eval.py - Generic script for training and evaluation of ranking models. It takes the preprocessed dataset from preprocessing.py and schema as input. You can set many different training and model hparams for train both single-task learning (MLP, DCN, DLRM, Wide&Deep, DeepFM) and multi-task learning specific models (e.g. MMOE, CGC, PLE).

python  ranking_train_eval.py --train_path $OUT_DATASET_PATH/final_dataset/train --eval_path $OUT_DATASET_PATH/final_dataset/eval --output_path ./outputs/ --tasks=click --stl_positive_class_weight 4 --model dlrm --embeddings_dim 64 --l2_reg 1e-5 --embeddings_l2_reg 1e-6 --dropout 0.05 --mlp_layers 64,32  --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 4096 --eval_batch_size 4096 --epochs 1 --train_steps_per_epoch 10

Testing Details :mag:

The preprocessing and training ranking scripts are going to be added as integration tests.

Tasks

Implementation

[x] #916
[x] #917
[x] #918
[x] Refine preprocessing.py to provide additional dataset split strategies (e.g. random_by_user, temporal).
[x] Adapt preprocessing.py to use Dask Distributed client for preprocessing larger/full dataset (single or multiple GPU)

Experimentation

Documentation

[x] https://github.com/NVIDIA-Merlin/Merlin/issues/666 - Create a markdown document providing best practices on setting hyperparameters for ranking models based on the empirical results from our research experimentation (e.g. hparam optimization search space, best hparams found, comparison of the accuracy of STL and MTL models for each task)

You can check the Quick-start for ranking documentation starting from this main page

github-actions[bot] commented 1 year ago

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-915

rnyak commented 1 year ago

@gabrielspmoreira One think I think we can improve is the prediction step. I tested the script you shared with me for prediction but it retrains the model.. but is there a prediction script that user can feed the saved model path and then do the batch predict automatically without training again? It'd be better if we can provide an example code snippet how one can do the prediction.

gabrielspmoreira commented 1 year ago

@gabrielspmoreira One think I think we can improve is the prediction step. I tested the script you shared with me for prediction but it retrains the model.. but is there a prediction script that user can feed the saved model path and then do the batch predict automatically without training again? It'd be better if we can provide an example code snippet how one can do the prediction.

Indeed. Following your suggestion, I made it possible to save the trained model with --save_model_path, then run the script again providing --load_model_path, in this case not providing train_data_path but just --predict_data_path, so that the script loads the trained model and just perform the batch predict, saving them to --predict_output_path

NVIDIA-Merlin / Merlin