NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[FEA] Add window functions #740

Open karlhigley opened 3 years ago

benfred commented 3 years ago

We have first and last support already in v0.5 -

rjzamora commented 3 years ago

As in #734, the list.take method could possibly be used for a first-order solution if something beyond first/last is needed.

gabrielspmoreira commented 3 years ago

Indeed, we have got the aggregation functions we needed for session-based recommendation within this closed issue #641, which introduced the nvt.ops.Groupby() op that aggregates interactions by a column (e.g. session or user id), sorts the interactions by another column (e.g., timestamp), and then provide either a "list", of the "first" or "last" element in the list.

gabrielspmoreira commented 3 years ago

For session-based recommendation, when the session id is not provided in the dataset, we use the idle time between user interactions to split the sessions (usually maximum of 30 min between two consecutive interactions within a session). I understand that I could use the ops.DifferenceLag() partitioned by userid to get the elapsed time between user interactions timestamp. But I am not sure how could I use this new "delta time" feature to generate the same session id for interactions with lower delta time, or to split the sessions in lists as we use the nvt.ops.Groupby(). I don't know if this use case would fit an aggregation or window function, if not I can open a separate issue for this one.

gabrielspmoreira commented 3 years ago

Regarding the window functions (not specific to session-based recommendation) it is a common feature engineering practice to use lead and lag features for time series and recommender systems in general. We have the ops.DifferenceLag() op to compute the difference between the current value and the previous value of a feature for a user. But it would be very useful to have 'Lag()' and Lead() ops, which return the actual "past" and "future values" for a given feature, partitioned by a column (e.g. user), which is possible with the usage of the shift(1) or shift(-1) with Pandas partitioned by a column (e.g. user). This sliding window feature is a FEA on the cuDF repo, which is being addressed by this PR, so hopefully that will make it easier to integrated it in NVTabular.

As an example, KGMON has used this feature in the Booking.com challenge to have for each training row the last 5 cities in a sequence (e.g. shift(5), shift(4), shift(3, shift(2), shift(1), partitioned by trip), like in this example on cuDF

def shift_feature(df, groupby_col, col, offset, nan=-1, colname=''):
    df[colname] = df[col].shift(offset)
    df.loc[df[groupby_col]!=df[groupby_col].shift(offset), colname] = nan

shift_feature(raw, 'utrip_id_', 'city_id_', 1, NUM_CITIES, f'city_id_lag{1}')
shift_feature(raw, 'utrip_id_', 'city_id_', 2, NUM_CITIES, f'city_id_lag{2}')
...

I have used this feature using cuDF to remove consecutive repeated user interactions in the same item, as in the following example:

# Sorts the dataframe by session and timestamp, to remove consecutive repetitions
interactions_df = interactions_df.sort_values(['session_id', 'timestamp'])
interactions_df['item_id_past'] = interactions_df['item_id'].shift(1)
interactions_df['session_id_past'] = interactions_df['session_id'].shift(1)
#Keeping only no consectutive repeated in session interactions
interactions_df = interactions_df[~((interactions_df['session_id'] == interactions_df['session_id_past']) & \
                 (interactions_df['item_id'] == interactions_df['item_id_past']))]

In both cases, we did a hack on cuDF compared to the shift() available in Pandas, which supports partitioning by column as in the example of this FEA