Open sarahyurick opened 7 months ago
I believe the reason we have a custom read_json implementation is the ability to specify files_per_partition
and combine multiple files into a single read_json call from cudf which isn't supported in dask dataframe. Since parquet
and a few others have support for many params ootb, it makes sense to mimic dask in the parquet case.
From #46: " https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fuzzy_deduplication.py#L53-L60
Is there a reason you didn't use DocumentDataset.read_parquet
? I would prefer to use that or expand its flexibility such that you can do what you need to do.
Yeah, the DocumentDataset.read_parquet
functionality is a bit lacking in column support and a few other missing config options. I'd prefer the DocumentDataset.read_parquet
method to mimic Dask's read_parquet
for the time being.
I would be interested in that discussion as well. My intuition is that we should mimic the behavior of Dask as much as possible, but there might be good reasons to deviate.
Yeah, I agree that the goal should be to mimic Dask's read_*
functions as best as possible, probably with kwargs.
"
From #130: " Couple of things here:
num_samples
parameter is misleading in its name. A sample typically refers to a single document, while in this case it appears to be referring to a file. Can this be renamed to num_files
or num_shards
?I am not a fan of having another unique function for reading files/figuring out what files to be read. It makes code much more confusing and harder to maintain. I want to enforce some kind of consistency. Even within semantic dedup, each of the three CLI scripts have a different way of reading in files:
compute_embeddings.py
(this script) uses read_data
with the new get_input_files
function.clustering.py
uses dask_cudf.read_parquet
.extract_dedup_data.py
reads the files in deep in SemanticClusterLevelDedup.compute_semantic_match_dfs
, which eventually calls cudf.read_parquet
We already have noted that our file operations are not easy to work with, and our future plans are only going to get harder as we introduce more ways of reading in files.
The way I want users (and us) to read in files (right now) is this:
DocumentDataset.read_*
whenever you know the datatype at the time of writing the script.read_data
whenever you don't. We should eventually make a similar function directly in DocumentDataset
, but that's beside the point.read_data
should be the way to go with in the CLI scripts. get_remaining_files
or get_all_files_paths_under
can be helpers for that function if needed (I'm not a fan of having two helpers in the first place either, but again, beside the point). I'd rather not have a new helper method like this in the mix too. In this case, perhaps we could merge this function with the get_remaining_files
function. See my comment below for more on that.
Furthermore, we shouldn't need to be working around our file operations. If we feel that we need to do that, we should modify them instead to fit our usecase. I know we're on a crunch right now, but anything you can do to get us closer to the ideal case I mentioned above would be great.
"
and
" I agree with the spirit of having consistent IO format but we wont be able to do it till we address https://github.com/NVIDIA/NeMo-Curator/issues/50, like
compute_embeddings.py (this script) uses read_data
with the new get_input_files
function.
Agreed, merging with read_data
.
clustering.py uses dask_cudf.read_parquet
because we don't have a block-wise support which is important from performance. Once we fix it, I am happy to revisit this.
SemanticClusterLevelDedup.compute_semantic_match_dfs
calls cudf.read_parquet
, Unfortunately there is no straightforward way for this. We should pick the tool as needed especially for complex workflows so I think we are stuck there.
For now, I will link https://github.com/NVIDIA/NeMo-Curator/issues/50 here and merge get_remaining_files
. I hope that's a good middle path.
"
From #77:
"
Do you think we should move away from input_meta
in favor of a keyword like dtype
(like Pandas' and cuDF's read_json
) and having the user configure prune_columns
themselves?
I'm generally in favor of overhauling the IO helpers in the current setup for something better. When we tackle https://github.com/NVIDIA/NeMo-Curator/issues/50. I'll share more thoughts there, but moving to encouraging users using the read_xyz
api's is easier.
We can then have a common helper that based on the filetype directs to the relevant read_xyz
api rather than the other way around where read_json
goes to a common read method that handles different formats.
Regarding: prune_columns
specifically: This change is important in newer versions of rapids because many public datasets like rpv1 do not have consistent metadata across all their files. If we do not prune columns to just ID & Text, cuDF will now fail with inconsistent metadata errors.
"
Another TODO: Support for .json.gz
.
Right now,
DocumentDataset
has a couple ofread_*
functions: (1)(2)
(3)
It would be good if these functions could support Dask's read_json and read_parquet parameters (there is no
read_pickle
function in Dask but we can perhaps look to Pandas for this).In addition to this, we can restructure our
to_*
functions as well.