NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
722
stars
112
forks
source link
Creating integration tests for quick-start for ranking #1015
Closes #667
This PR creates the integration tests for quick-start for ranking scripts, which includes preprocessing the TenRec dataset with different options and training ranking models on the preprocessed data.
Preprocessing tests
Check for basic preprocessing + target encoding, and proper tagging, dtype and number of rows and max values
Tests the available data split strategies: random, random_by_user, temporal
Tests the available filtering strategies: query string and min/max frequency for users and items
Tests frequency capping
Model building, training and evaluation tests
Trains single task-learning models with the model specific options: MLP, DLRM, DCN-v2, Wide&Deep, DeepFM
Trains single task-learning models with the model specific options: DLRM, MMOE, PLE
Data setup
These integration tests require a 10M rows sample of the TenRec dataset, which is available in this internal Google Drive (tenrec_ci.zip).
The data needs to be downloaded in the CI machine and uncompressed to /raid/data/tenrec_ci/, which is the standard path where our other CI datasets are (e.g. /raid/data/lastfm/preprocessed).
P.s. If needed, the path for the TenRec sample data can be set by using the CI_TENREC_DATA_PATH env variable
Closes #667 This PR creates the integration tests for quick-start for ranking scripts, which includes preprocessing the TenRec dataset with different options and training ranking models on the preprocessed data.
Preprocessing tests
random
,random_by_user
,temporal
Model building, training and evaluation tests
Data setup
These integration tests require a 10M rows sample of the TenRec dataset, which is available in this internal Google Drive (
tenrec_ci.zip
). The data needs to be downloaded in the CI machine and uncompressed to/raid/data/tenrec_ci/
, which is the standard path where our other CI datasets are (e.g./raid/data/lastfm/preprocessed
). P.s. If needed, the path for the TenRec sample data can be set by using theCI_TENREC_DATA_PATH
env variable