NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

[WIP][Experiment] Use multi-threading for multi-file json reads #94

Open rjzamora opened 4 weeks ago

rjzamora commented 4 weeks ago

Experimental change to improve IO performance when multiple json files are mapped to each dask-dataframe partition.

Context: I was originally exploring a similar optimization to improve remote-storage performance, and found a significant perf bump for local storage as well.

rjzamora commented 4 weeks ago

cc @ayushdg