NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[FEA] Allow for saving index in `Dataset.to_parquet` #909

Open kshitizgupta21 opened 3 years ago

kshitizgupta21 commented 3 years ago

Is your feature request related to a problem? Please describe. Need to set a particular column as the index after preprocessing a Dataset through a workflow and would like to save the index when using Dataset.to_parquet.

Describe the solution you'd like Have Dataset.to_parquet support saving the index of the dataset

Describe alternatives you've considered Currently using dask-cudf's to_parquet which allows for saving index.

rjzamora commented 3 years ago

It should be fine to allow the user to specify an index column in to_parquet. However, NVTabular will not treat that column as an index if/when you use that path to initialize a new Dataset (it will be treated as a typical column). Is this sufficient?

kshitizgupta21 commented 3 years ago

It would be nice if NVTabular could mimic dask_cudf's behavior and treat that column as an index when using that path to initialize a new Dataset because otherwise the user has to first convert the Dataset to Dask Dataframe through to_ddf(), then set the index and then convert it back to Dataset for downstream preprocessing and usage. Currently, the user already has to do these conversions when trying to set the index. It would be nice if they don't have to do this a second time.

rjzamora commented 3 years ago

It would be nice if NVTabular could mimic dask_cudf's behavior and treat that column as an index when using that path to initialize a new Dataset because otherwise the user has to first convert the Dataset to Dask Dataframe through to_ddf(), then set the index and then convert it back to Dataset for downstream preprocessing and usage. Currently, the user already has to do these conversions when trying to set the index. It would be nice if they don't have to do this a second time.

I'm not really following the argument here, so I suspect that I am misunderstanding the use case.

NVTabular does not make any promises about recognizing and/or preserving an index during transformations, so the user should not be relying on an index. This was an explicit design choice in NVTabular to not support global indexing, and so we typically ingnore the index under the hood. Therefore, I am not exactly following why a user would bother moving to dask_cudf to set some column to be the index unless they are trying to sort the global dataset by that column (in which case, they would still need to do this in dask_cudf, even if the NVTabular Dataset API could set an index column).