NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.02k stars 143 forks source link

[FEA] Support multi-dimensional list columns (use case: session-based recs with multi-hot and embeddings features) #895

Open gabrielspmoreira opened 3 years ago

gabrielspmoreira commented 3 years ago

Is your feature request related to a problem? Please describe. Currently NVTabular only supports 1-dimensional list columns, which is ok to support lists of categorical values (e.g. multi-hot features) or numerical values (e.g. pre-trained embeddings features).

For session-based recommendation or sequential recommendation, simple (non-list) features become 1D list features to represent the sequence of user interactions (e.g. item ids, product category, product price). And 1D list features (e.g. multi-hot or embeddings) should become 2D list features, which is currently not supported by NVTabular

Describe the solution you'd like NVTabular should be able to support processing, saving and data loading multi-dimensional list (sparse) columns, in order to support multi-hot and embeddings for session-based / sequential recommendation. The parquet format does support storing such multi-dimensional list columns and is not a limitation for that

Describe alternatives you've considered In some cases, to be able to use pre-trained embeddings with NVTabular (like in the SIGIR eCom 2021 Data Challenge, where they provide product description, product image and search query embeddings), I flattened the 2D (session interactions x embedding dim) features into 1D vector, saved to parquet with NVTabular, and reshaped back to 2D in the model side. But as product description/image vectors are not available for all products, I have to fill null vectors with zeroed vectors with the same size, so that when the 2D vectors are reconstructed in the model size their position are consistent with the other interaction features.

rnyak commented 3 years ago

@gabrielspmoreira I saw that this PR was merged: https://github.com/NVIDIA/NVTabular/pull/911#event-4956796978

would that be a temporary solution for this FEA?

gabrielspmoreira commented 3 years ago

Thanks for the pointer Sara. The #911 might be a temporary solution for such use cases. We need to test it with PyTorch data loader, as the test provided use TF Data loader.

mats-claassen commented 1 year ago

Hi, is this still not supported? I tried to run Transformers4Rec with multi-hot columns and got errors when trying to apply the workflow. When removing the multi-hot columns everything works fine.