Add Encoder & Predictor

marcromeyn commented 1 year ago

Goals :soccer:

This PR introduces the Encoder and Predictor classes, to add batch-prediction capabilities in the PyTorch backend.

Implementation Details :construction:

Encoder

The Encoder is meant to be used for things like embedding extraction.

>>> dataset = Dataset(...)
>>> model = mm.TwoTowerModel(dataset.schema)
# `selection=Tags.USER` ensures that only the sub-module(s) of the model
# that processes features tagged as user is used during encoding.
# Additionally, it filters out all other features that aren't tagged as user.
>>> user_encoder = Encoder(model[0], selection=Tags.USER)
# The index is used in the resulting DataFrame after encoding
# Setting unique=True (default value) ensures that any duplicate rows
# in the DataFrame, based on the index, are dropped, leaving only the
# first occurrence.
>>> user_embs = user_encoder(dataset, batch_size=128, index=Tags.USER_ID)
>>> print(user_embs.compute())
user_id    0         1         2    ...   37        38        39        40
0       ...  0.1231     0.4132    0.5123  ...  0.9132    0.8123    0.1123
1       ...  0.1521     0.5123    0.6312  ...  0.7321    0.6123    0.2213
...     ...  ...        ...       ...     ...  ...       ...       ...

Predictor

On the other hand, the Predictor class, will return both the original input data and the corresponding predictions in the output-DF.

>>> dataset = Dataset(...)
>>> model = mm.TwoTowerModel(dataset.schema)
>>> predictor = Predictor(model)
>>> predictions = predictor(dataset, batch_size=128)
>>> print(predictions.compute())
user_id  user_age  item_id  item_category  click  click_prediction
0        24        101      1             1      0.6312
1        35        102      2             0      0.7321
...      ...       ...      ...           ...    ...

github-actions[bot] commented 1 year ago

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1112

edknv commented 1 year ago

Some tests in CPU tests seem to be failing (sample run) related to dask and/or dlpack, e.g.,

E           ValueError: Metadata inference failed in `encode_df`.
E           
E           You have supplied a custom function and Dask is unable to 
E           determine the type of output that that function returns. 
E           
E           To resolve this please provide a meta= keyword.
E           The docstring of the Dask function you ran should have more information.
E           
E           Original error is below:
E           ------------------------
E           BufferError('DLPack only supports signed/unsigned integers, float and complex dtypes.')

NVIDIA-Merlin / models