NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.04k stars 143 forks source link

[FEA] Negative sampling for positive-only datasets #356

Open gabrielspmoreira opened 3 years ago

gabrielspmoreira commented 3 years ago

Motivation

Public datasets are generally provided with negative samples to make it easier to train and compare results for different algorithms. Although, the most common for industry use cases is to have a dataset with only the users interactions (positive-only), as items that the user (might) have seen and not interacted are usually not logged. Most modern neural architectures need negative candidates for optimized training, because the number of items in the catalog of large-scale recsys is in the order of millions.

Requirements

RQ01 - Be Available in both NVT Pre-processing and Data Loading

The candidate sampling should be primarily performed by the NVT Data Loader, so that for different epochs we might have different negative samples for each positive sample. But it could also be available during pre-processing, in cases you would like to persist some fixed negative samples to compare different training algorithms which might not use the NVT Data Loader.

RQ02 - Feature Sets config

Provide a configuration of feature sets to bring RecSys taxonomy for some important features. That configuration will be used during NVT pre-processing, and should be persisted to be available also for the NVT Data Loaders and for custom training/eval scripts. The minimum features sets to allow candidate sampling managed by NVT and temporal dataset split are:

RQ03 - Recommendable items set

Provide the following methods to form the recommendable items set, composed by items that were available for users in a given point of time, to be considered as a valid negative samples:

RQ04 - Sampling methods

Provide the following methods for negative sampling from the recommendable items set:

References: Doc - NVTabular - Requirements on pre-processing for session-based recommendation and candidate sampling

gabrielspmoreira commented 3 years ago

As a side note, I have read recently the paper "Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison" from RecSys 2020, where they perform a rigorous evaluation of many algorithms, datasets, preprocessing strategies, losses functions and negative sampling strategies. In section 3.3, they show that Uniform Sampling, although simple, usually produced models with better accuracy than Popularity negative sampling.

It is important to note that their models use only user and item ids (CF). But for models leveraging with additional features (e.g. item popularity, target encoding of item id), such features could leak which are the positive (usually popular items) and the negatives (usually unpopular items if uniformly sampled).

Thus, it is also important to provide the popularity-based negative sampling in this feature too, and maybe a setting to control the percentage of negative items that will be sampled from uniform and from popularity distribution, like in this paper (Section 4.1).

gabrielspmoreira commented 3 years ago

I have implemented an example of sampling with cuDF, where you can set a continuous parameter which ranges between 0.0 (uniform sampling and 1.0 (popularity sampling). This provides more flexibility to the user, and might be a hyperparameter in the training pipeline.

karlhigley commented 3 years ago

Side note: I have a lot of question marks about negative sampling strategies, loss functions, and offline evaluation after reading "How Sensitive is Recommendation Systems’ Offline Evaluation to Popularity?" Although the paper is framed as being about evaluation, I think it's also revealing about the impact of different sampling strategies (e.g. BPR vs. WARP) on popularity-related biases. This is an area I'd love to explore and understand better.