[FEA] Improve download and extract utility

Errors when incorrectly using the download and extraction utilities are hard to debug, and the purpose of download and extraction can be unclear.

Describe the solution you'd like For download_common_crawl when snapshot numbers are invalid we get this error:

Traceback (most recent call last):
  File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 53, in <module>
    main(attach_args().parse_args())
  File "/workspace/NeMo-Curator/examples/download_common_crawl.py", line 35, in main
    common_crawl = download_common_crawl(
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/commoncrawl.py", line 342, in download_common_crawl
    dataset = download_and_extract(
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/download/doc_builder.py", line 185, in download_and_extract
    df = dd.from_map(
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/io.py", line 992, in from_map
    raise ValueError("All `iterables` must have a non-zero length")

We should do the following:

Update the documentation to explicitly call out that the snapshot numbers must be valid
Provide a link to the list of valid snapshots in the documentation.
Catch this error and rethrow an error with a more informative error message.

Furthermore, we should clarify the purpose of the download_and_extract utility. Users should not be under the impression that NeMo Curator does web crawling. We can optionally provide a simple download and extraction utility that operates on raw urls.

NVIDIA / NeMo-Curator

[FEA] Improve download and extract utility #80