Open kshitizgupta21 opened 3 years ago
It should be fine to allow the user to specify an index column in to_parquet
. However, NVTabular will not treat that column as an index if/when you use that path to initialize a new Dataset
(it will be treated as a typical column). Is this sufficient?
It would be nice if NVTabular could mimic dask_cudf's behavior and treat that column as an index when using that path to initialize a new Dataset
because otherwise the user has to first convert the Dataset
to Dask Dataframe through to_ddf()
, then set the index and then convert it back to Dataset
for downstream preprocessing and usage. Currently, the user already has to do these conversions when trying to set the index. It would be nice if they don't have to do this a second time.
It would be nice if NVTabular could mimic dask_cudf's behavior and treat that column as an index when using that path to initialize a new Dataset because otherwise the user has to first convert the Dataset to Dask Dataframe through to_ddf(), then set the index and then convert it back to Dataset for downstream preprocessing and usage. Currently, the user already has to do these conversions when trying to set the index. It would be nice if they don't have to do this a second time.
I'm not really following the argument here, so I suspect that I am misunderstanding the use case.
NVTabular does not make any promises about recognizing and/or preserving an index during transformations, so the user should not be relying on an index. This was an explicit design choice in NVTabular to not support global indexing, and so we typically ingnore the index under the hood. Therefore, I am not exactly following why a user would bother moving to dask_cudf to set some column to be the index unless they are trying to sort the global dataset by that column (in which case, they would still need to do this in dask_cudf, even if the NVTabular Dataset API could set an index column).
Is your feature request related to a problem? Please describe. Need to set a particular column as the index after preprocessing a Dataset through a workflow and would like to save the index when using
Dataset.to_parquet
.Describe the solution you'd like Have
Dataset.to_parquet
support saving the index of the datasetDescribe alternatives you've considered Currently using dask-cudf's
to_parquet
which allows for saving index.