NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

Fix bug #105 #111

Open nicoleeeluo opened 2 weeks ago

nicoleeeluo commented 2 weeks ago

Description

This PR is to fix the bug in #105 by updating the code to latest version

Usage

# Add snippet demonstrating usage

Checklist

miguelusque commented 2 weeks ago

Hi @nicoleeeluo ,

Thank you for your fix.

I think that,instead of passing None to input_meta, I would change io_utils.py as follows:

def get_text_ddf_from_json_path_with_blocksize( input_data_paths, num_files, blocksize, id_column, text_column, input_meta=None ):

That would fix not only this notebook, but any other potential case.

Looking forward for your thoughts about it. Thanks!

ryantwolf commented 1 week ago

I agree with @miguelusque 's suggestion, it should allow you to keep your tutorial unchanged!

nicoleeeluo commented 5 days ago

@miguelusque @ryantwolf Changing io_utils.py sounds good to me! If the change on io_utils.py has been made, I will discard this PR.