NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
478 stars 57 forks source link

[BUG] download process has memory leak during extraction to jsonl #38

Open zahramahani opened 5 months ago

zahramahani commented 5 months ago

Describe the bug

whenever i run downlod_common_crawl.py code in examples folder after it downloaded the shards, it starts to extract the data. in between warnings come up which says this code doesnt free the memory and after a while it kills the process.

next problem is this extraction takes so long time for me, it extracts 10 shards in about 1 and 45 minutes. is there any extra configuration that i have missed?

here is the SS of problem

Screenshot from 2024-04-19 00-17-05

Environment overview

Environment details