Better mimic DocumentDataset's `read_*` functions to Dask's `read_*` functions

sarahyurick commented 7 months ago

Right now, DocumentDataset has a couple of read_* functions: (1)

def read_json(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(2)

def read_parquet(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(3)

def read_pickle(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

It would be good if these functions could support Dask's read_json and read_parquet parameters (there is no read_pickle function in Dask but we can perhaps look to Pandas for this).

In addition to this, we can restructure our to_* functions as well.

ayushdg commented 7 months ago

I believe the reason we have a custom read_json implementation is the ability to specify files_per_partition and combine multiple files into a single read_json call from cudf which isn't supported in dask dataframe. Since parquet and a few others have support for many params ootb, it makes sense to mimic dask in the parquet case.

sarahyurick commented 2 months ago

From #46: " https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fuzzy_deduplication.py#L53-L60

Is there a reason you didn't use DocumentDataset.read_parquet? I would prefer to use that or expand its flexibility such that you can do what you need to do.

Yeah, the DocumentDataset.read_parquet functionality is a bit lacking in column support and a few other missing config options. I'd prefer the DocumentDataset.read_parquet method to mimic Dask's read_parquet for the time being.

I would be interested in that discussion as well. My intuition is that we should mimic the behavior of Dask as much as possible, but there might be good reasons to deviate.

Yeah, I agree that the goal should be to mimic Dask's read_* functions as best as possible, probably with kwargs. "

sarahyurick commented 2 months ago

From #130: " Couple of things here:

After reading the body of this function, the num_samples parameter is misleading in its name. A sample typically refers to a single document, while in this case it appears to be referring to a file. Can this be renamed to num_files or num_shards?
I am not a fan of having another unique function for reading files/figuring out what files to be read. It makes code much more confusing and harder to maintain. I want to enforce some kind of consistency. Even within semantic dedup, each of the three CLI scripts have a different way of reading in files:
- compute_embeddings.py (this script) uses read_data with the new get_input_files function.
- clustering.py uses dask_cudf.read_parquet.
- extract_dedup_data.py reads the files in deep in SemanticClusterLevelDedup.compute_semantic_match_dfs, which eventually calls cudf.read_parquet
We already have noted that our file operations are not easy to work with, and our future plans are only going to get harder as we introduce more ways of reading in files.

The way I want users (and us) to read in files (right now) is this:
- Use DocumentDataset.read_* whenever you know the datatype at the time of writing the script.
- Use read_data whenever you don't. We should eventually make a similar function directly in DocumentDataset, but that's beside the point.
read_data should be the way to go with in the CLI scripts. get_remaining_files or get_all_files_paths_under can be helpers for that function if needed (I'm not a fan of having two helpers in the first place either, but again, beside the point). I'd rather not have a new helper method like this in the mix too. In this case, perhaps we could merge this function with the get_remaining_files function. See my comment below for more on that.

Furthermore, we shouldn't need to be working around our file operations. If we feel that we need to do that, we should modify them instead to fit our usecase. I know we're on a crunch right now, but anything you can do to get us closer to the ideal case I mentioned above would be great.

"

and

" I agree with the spirit of having consistent IO format but we wont be able to do it till we address https://github.com/NVIDIA/NeMo-Curator/issues/50, like

compute_embeddings.py (this script) uses read_data with the new get_input_files function. Agreed, merging with read_data.
clustering.py uses dask_cudf.read_parquet because we don't have a block-wise support which is important from performance. Once we fix it, I am happy to revisit this.
SemanticClusterLevelDedup.compute_semantic_match_dfs calls cudf.read_parquet, Unfortunately there is no straightforward way for this. We should pick the tool as needed especially for complex workflows so I think we are stuck there.

For now, I will link https://github.com/NVIDIA/NeMo-Curator/issues/50 here and merge get_remaining_files. I hope that's a good middle path. "

sarahyurick commented 2 months ago

From #77: " Do you think we should move away from input_meta in favor of a keyword like dtype (like Pandas' and cuDF's read_json) and having the user configure prune_columns themselves?

I'm generally in favor of overhauling the IO helpers in the current setup for something better. When we tackle https://github.com/NVIDIA/NeMo-Curator/issues/50. I'll share more thoughts there, but moving to encouraging users using the read_xyz api's is easier. We can then have a common helper that based on the filetype directs to the relevant read_xyz api rather than the other way around where read_json goes to a common read method that handles different formats.

Regarding: prune_columns specifically: This change is important in newer versions of rapids because many public datasets like rpv1 do not have consistent metadata across all their files. If we do not prune columns to just ID & Text, cuDF will now fail with inconsistent metadata errors. "

sarahyurick commented 1 month ago

Related PRs:

sarahyurick commented 1 month ago

Another TODO: Support for .json.gz.

NVIDIA / NeMo-Curator

Better mimic DocumentDataset's `read_` functions to Dask's `read_` functions #50

NVIDIA / NeMo-Curator

Better mimic DocumentDataset's `read_*` functions to Dask's `read_*` functions #50

Better mimic DocumentDataset's `read_` functions to Dask's `read_` functions #50