Open lazykyama opened 4 years ago
We use the cudf reader under the hood and it's supported there: https://docs.rapids.ai/api/cudf/stable/api.html?highlight=read_csv#cudf.io.csv.read_csv (delimiter + compression options)
We can expose these options, or even automatically infer them?
cudf supports this - but we are doing a couple of things when trying to infer the row sizes that breaks this. This should be relatively easy to support (just add a unittest testing out compressed datasets, and fix the couple of lines in the CSVFileReader that break by opening as gzip first).
We're also using the byte_range feature for iterating over CSV files that don't fit into memory, which is unsupported by cudf with compressed files:
/home/ben/code/cudf/cpp/src/io/csv/reader_impl.cu:188: Reading compressed data using `byte range` is unsupported
Supporting gzipped CSV/TSV seems to mean that we might need to the entire CSV file to fit into GPU memory =(
What is your question?
Simple question. Does
nvt.dataset
support compressed tsv file? Likeday_0.gz
in Criteo dataset used in the example... Must we decompress it in advance?Right now, at the following line,
UnicodeDecodeError
happens because NVTabular tries to open given file as a text file although the file is gzip compressed... https://github.com/NVIDIA/NVTabular/blob/master/nvtabular/io.py#L182