NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.03k stars 143 forks source link

[FEA]Support both list and non-list format for multi-hot feature input #1022

Open jershi425 opened 3 years ago

jershi425 commented 3 years ago

Is your feature request related to a problem? Please describe. Currently, for multi-hot features, we have to first convert them into a list format and then process it with NVT. For example, the raw format of "genres" in the movielens data is like "drama|comedy". I have to first convert it to ["drama", "comedy"] using cudf or pandas and then applying label encoding using NVT. Therefore, if the users want to do a label encoding for multi-hot feature, they have to first process and save it with cudf or pandas and then read again with NVT to do the encoding.

Describe the solution you'd like It would be good if NVT can treat both list and strings as multi-hot input. Say if the input is string like "drama|comedy" or "drama,comedy", NVT can automatically treat them as multi-hot input.

benfred commented 3 years ago

you can do this with a lambadop in NVTabular - For example to create a list column from strings like "drama|comedy" then categorical encode them in NVT would be something like:

genres = nvt.ColumnGroup(["genres"]) >> (lambda col: col.str.split("|")) >> nvt.ops.Categorify()
workflow = nvt.Workflow(genres)
workflow.fit_transform(dataset)