NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
478 stars 57 forks source link

[FEA] Add --meta parameter to explicitly specify the jsonl field dtypes #63

Closed miguelusque closed 4 months ago

miguelusque commented 4 months ago

Is your feature request related to a problem? Please describe. When reading jsonl files with Dask, the dataframe datatypes are inferred unless explicitly specified.

Inferring the data types can lead to several issues, such as incorrect type inference, degradation of performance and increased memory usage among others.

I think we could mitigate those issues if we would add a --meta parameter, which would receive a dictionary of datatypes.

That parameter would be optional, and be similar to the --meta parameter available here: https://docs.dask.org/en/latest/generated/dask.dataframe.read_json.html.

miguelusque commented 4 months ago

I will work in the feature.

miguelusque commented 4 months ago

PR https://github.com/NVIDIA/NeMo-Curator/pull/75.