[FEA]Support both list and non-list format for multi-hot feature input

NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Apache License 2.0

1.03k stars 143 forks source link

Is your feature request related to a problem? Please describe. Currently, for multi-hot features, we have to first convert them into a list format and then process it with NVT. For example, the raw format of "genres" in the movielens data is like "drama|comedy". I have to first convert it to ["drama", "comedy"] using cudf or pandas and then applying label encoding using NVT. Therefore, if the users want to do a label encoding for multi-hot feature, they have to first process and save it with cudf or pandas and then read again with NVT to do the encoding.

Describe the solution you'd like It would be good if NVT can treat both list and strings as multi-hot input. Say if the input is string like "drama|comedy" or "drama,comedy", NVT can automatically treat them as multi-hot input.

NVIDIA-Merlin / NVTabular

[FEA]Support both list and non-list format for multi-hot feature input #1022