NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
477 stars 57 forks source link

[FEA] Improve download and extract utility #80

Closed ryantwolf closed 3 months ago

ryantwolf commented 4 months ago

Errors when incorrectly using the download and extraction utilities are hard to debug, and the purpose of download and extraction can be unclear.

Describe the solution you'd like For download_common_crawl when snapshot numbers are invalid we get this error:

Traceback (most recent call last):
  File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 53, in <module>
    main(attach_args().parse_args())
  File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 35, in main
    common_crawl = download_common_crawl(
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/commoncrawl.py", line 342, in download_common_crawl
    dataset = download_and_extract(
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/doc_builder.py", line 185, in download_and_extract
    df = dd.from_map(
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/io.py", line 992, in from_map
    raise ValueError("All `iterables` must have a non-zero length")

We should do the following:

Furthermore, we should clarify the purpose of the download_and_extract utility. Users should not be under the impression that NeMo Curator does web crawling. We can optionally provide a simple download and extraction utility that operates on raw urls.