NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

Allow multiple filenames per partition when separating by metadata #99

Closed ayushdg closed 13 hours ago

ayushdg commented 3 weeks ago

Description

Closes #89

Adds an option to handle scenarios where there are multiple filenames in a single partition when writing with filename. This is typically true when files are read in with files_per_partition > 1.

Usage

dataset = DocumentDataset.read_json(path, files_per_partition=5, include_filename=True)
write_to_disk(dataset.df, output_file_dir=path, write_to_filename=True)

Checklist