NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

Better mimic DocumentDataset's `read_*` functions to Dask's `read_*` functions #50

Open sarahyurick opened 1 month ago

sarahyurick commented 1 month ago

Right now, DocumentDataset has a couple of read_* functions: (1)

def read_json(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(2)

def read_parquet(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(3)

def read_pickle(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

It would be good if these functions could support Dask's read_json and read_parquet parameters (there is no read_pickle function in Dask but we can perhaps look to Pandas for this).

In addition to this, we can restructure our to_* functions as well.

ayushdg commented 1 month ago

I believe the reason we have a custom read_json implementation is the ability to specify files_per_partition and combine multiple files into a single read_json call from cudf which isn't supported in dask dataframe. Since parquet and a few others have support for many params ootb, it makes sense to mimic dask in the parquet case.