Closed rnyak closed 3 years ago
This seems to be because of the shuffling behaviour - we default to using the dask-cudf 'to_parquet' functionality when we are not shuffling:https://github.com/NVIDIA/NVTabular/blob/09f6afc05ab8872c96af8e5d91634b53ac1077b2/nvtabular/io/dataset.py#L406-L408 - which doesn't write out the hugectr filelist.txt file
@rjzamora should we just always use the NVT to _ddf_to_dataset code to write out parquet datasets? Is there any advantage to the dask_cudf.to_parquet code?
@rjzamora should we just always use the NVT to out_files_per_proc=None code to write out parquet datasets? Is there any advantage to the dask_cudf.to_parquet code?
It's probably fine to use out_files_per_proc=None
. I don't recall if there are any real advantages to using dask_cudf.to_parquet
- I'm not sure if it is still the case, but this code path was not originally the default, even with shuffle=None
, because it required the user to explicitly specify out_files_per_proc=None
.
@rnyak @rjzamora @benfred I get all the output files (file_list metadata, and parquet) with the updated Notebook: https://github.com/NVIDIA/NVTabular/blob/main/examples/hugectr/criteo-hugectr.ipynb
+1 I observed the same issue. Upon adding shuffle=nvt.io.Shuffle.PER_PARTITION
I finally get the meta data files
proc.transform(train_dataset).to_parquet(output_path=output_train_dir, dtypes=dict_dtypes,
shuffle=nvt.io.Shuffle.PER_PARTITION,
cats=cat_feats.columns,
conts=cont_feats.columns,
labels=['target'])
Describe the bug After we perform
workflow.transform()
both for train and validation set, we can see there is_file_list.txt
is generated in the training output folder, but it is missing in the validation output folder._file_list.txt
is required for HugeCTR training. There should be a file under train directory and another under valid directory and the file path is provided in the json file:https://github.com/NVIDIA/NVTabular/blob/main/examples/hugectr/dlrm_fp32_64k.json#L35
Expected behavior
_file_list.txt
should be written out for validation also.Environment details (please complete the following information):