NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[QST] Compressed tsv data support in nvt.dataset #101

Open lazykyama opened 4 years ago

lazykyama commented 4 years ago

What is your question?

Simple question. Does nvt.dataset support compressed tsv file? Like day_0.gz in Criteo dataset used in the example... Must we decompress it in advance?

Right now, at the following line, UnicodeDecodeError happens because NVTabular tries to open given file as a text file although the file is gzip compressed... https://github.com/NVIDIA/NVTabular/blob/master/nvtabular/io.py#L182

EvenOldridge commented 4 years ago

We use the cudf reader under the hood and it's supported there: https://docs.rapids.ai/api/cudf/stable/api.html?highlight=read_csv#cudf.io.csv.read_csv (delimiter + compression options)

We can expose these options, or even automatically infer them?

benfred commented 4 years ago

cudf supports this - but we are doing a couple of things when trying to infer the row sizes that breaks this. This should be relatively easy to support (just add a unittest testing out compressed datasets, and fix the couple of lines in the CSVFileReader that break by opening as gzip first).

benfred commented 4 years ago

We're also using the byte_range feature for iterating over CSV files that don't fit into memory, which is unsupported by cudf with compressed files:

/home/ben/code/cudf/cpp/src/io/csv/reader_impl.cu:188: Reading compressed data using `byte range` is unsupported

Supporting gzipped CSV/TSV seems to mean that we might need to the entire CSV file to fit into GPU memory =(