[BUG] _file_list.txt is not written out in the validation output folder

NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Apache License 2.0

1.05k stars 143 forks source link

[BUG] _file_list.txt is not written out in the validation output folder #524

Closed rnyak closed 3 years ago

rnyak commented 3 years ago

Describe the bug After we perform workflow.transform() both for train and validation set, we can see there is _file_list.txt is generated in the training output folder, but it is missing in the validation output folder.

_file_list.txt is required for HugeCTR training. There should be a file under train directory and another under valid directory and the file path is provided in the json file:

https://github.com/NVIDIA/NVTabular/blob/main/examples/hugectr/dlrm_fp32_64k.json#L35

Expected behavior _file_list.txt should be written out for validation also.

Environment details (please complete the following information):

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)] :Conda
Method of NVTabular install: [conda, Docker, or from source]: Conda + pip install

benfred commented 3 years ago

This seems to be because of the shuffling behaviour - we default to using the dask-cudf 'to_parquet' functionality when we are not shuffling:https://github.com/NVIDIA/NVTabular/blob/09f6afc05ab8872c96af8e5d91634b53ac1077b2/nvtabular/io/dataset.py#L406-L408 - which doesn't write out the hugectr filelist.txt file

@rjzamora should we just always use the NVT to _ddf_to_dataset code to write out parquet datasets? Is there any advantage to the dask_cudf.to_parquet code?

rjzamora commented 3 years ago

@rjzamora should we just always use the NVT to out_files_per_proc=None code to write out parquet datasets? Is there any advantage to the dask_cudf.to_parquet code?

It's probably fine to use out_files_per_proc=None. I don't recall if there are any real advantages to using dask_cudf.to_parquet - I'm not sure if it is still the case, but this code path was not originally the default, even with shuffle=None, because it required the user to explicitly specify out_files_per_proc=None.

albert17 commented 3 years ago

@rnyak @rjzamora @benfred I get all the output files (file_list metadata, and parquet) with the updated Notebook: https://github.com/NVIDIA/NVTabular/blob/main/examples/hugectr/criteo-hugectr.ipynb

vinhngx commented 3 years ago

+1 I observed the same issue. Upon adding shuffle=nvt.io.Shuffle.PER_PARTITION I finally get the meta data files

proc.transform(train_dataset).to_parquet(output_path=output_train_dir, dtypes=dict_dtypes,
                                         shuffle=nvt.io.Shuffle.PER_PARTITION,
                                         cats=cat_feats.columns,
                                         conts=cont_feats.columns,
                                         labels=['target'])