NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
638 stars 86 forks source link

Better mimic DocumentDataset's `read_*` functions to Dask's `read_*` functions #50

Open sarahyurick opened 7 months ago

sarahyurick commented 7 months ago

Right now, DocumentDataset has a couple of read_* functions: (1)

def read_json(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(2)

def read_parquet(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(3)

def read_pickle(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

It would be good if these functions could support Dask's read_json and read_parquet parameters (there is no read_pickle function in Dask but we can perhaps look to Pandas for this).

In addition to this, we can restructure our to_* functions as well.

ayushdg commented 7 months ago

I believe the reason we have a custom read_json implementation is the ability to specify files_per_partition and combine multiple files into a single read_json call from cudf which isn't supported in dask dataframe. Since parquet and a few others have support for many params ootb, it makes sense to mimic dask in the parquet case.

sarahyurick commented 2 months ago

From #46: " https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fuzzy_deduplication.py#L53-L60

Is there a reason you didn't use DocumentDataset.read_parquet? I would prefer to use that or expand its flexibility such that you can do what you need to do.

Yeah, the DocumentDataset.read_parquet functionality is a bit lacking in column support and a few other missing config options. I'd prefer the DocumentDataset.read_parquet method to mimic Dask's read_parquet for the time being.

I would be interested in that discussion as well. My intuition is that we should mimic the behavior of Dask as much as possible, but there might be good reasons to deviate.

Yeah, I agree that the goal should be to mimic Dask's read_* functions as best as possible, probably with kwargs. "

sarahyurick commented 2 months ago

From #130: " Couple of things here:

"

and

" I agree with the spirit of having consistent IO format but we wont be able to do it till we address https://github.com/NVIDIA/NeMo-Curator/issues/50, like

For now, I will link https://github.com/NVIDIA/NeMo-Curator/issues/50 here and merge get_remaining_files. I hope that's a good middle path. "

sarahyurick commented 2 months ago

From #77: " Do you think we should move away from input_meta in favor of a keyword like dtype (like Pandas' and cuDF's read_json) and having the user configure prune_columns themselves?

I'm generally in favor of overhauling the IO helpers in the current setup for something better. When we tackle https://github.com/NVIDIA/NeMo-Curator/issues/50. I'll share more thoughts there, but moving to encouraging users using the read_xyz api's is easier. We can then have a common helper that based on the filetype directs to the relevant read_xyz api rather than the other way around where read_json goes to a common read method that handles different formats.

Regarding: prune_columns specifically: This change is important in newer versions of rapids because many public datasets like rpv1 do not have consistent metadata across all their files. If we do not prune columns to just ID & Text, cuDF will now fail with inconsistent metadata errors. "

sarahyurick commented 1 month ago

Related PRs:

sarahyurick commented 1 month ago

Another TODO: Support for .json.gz.