NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

[REVIEW] Add Resiliparse option for text extraction #128

Open sarahyurick opened 4 days ago

sarahyurick commented 4 days ago

Duplicate of https://github.com/NVIDIA/NeMo-Curator/pull/90 with successful DCO check.

Right now, we only support Common Crawl text extraction with jusText. Resiliparse is known to be a faster text extraction algorithm which may also produce better tokens.

This PR adds optional support for the Resiliparse algorithm while still keeping jusText as the default.

sarahyurick commented 1 day ago

Hi @ryantwolf this is ready for another review.

The only question I have is what you think the best way to go about adding the unit tests is? Locally I'm testing it with download_common_crawl.py but that seems a bit heavy for CI? Even doing a single snapshot with url_limit = 10 takes at least a couple minutes each to run.

Edit: Perhaps it may be sufficient to add examples of JusTextExtraction() and ResiliparseExtraction() to download_common_crawl.py?

ryantwolf commented 17 hours ago

The only question I have is what you think the best way to go about adding the unit tests is?

Good question. I wouldn't add a unit test for the download_and_extract. I would just target each algorithm's extract_text method. You could draft like one or two simple html pages and make a unit test using each algorithm. Just something to make sure the behavior stays consistent.

Perhaps it may be sufficient to add examples of JusTextExtraction() and ResiliparseExtraction() to download_common_crawl.py?

It isn't a bad idea to showcase how users can change the algorithm. Do you mind updating the download.rst docs instead? I'm thinking right after we explain what output_type="jsonl" does you could add a snippet about the algorithm parameter. Or, you could add another small code block like

from nemo_curator.download import (
    download_common_crawl,
    ResiliparseExtraction,
)

# Change the extraction algorithm
extraction_algorithm = ResiliparseExtraction()
common_crawl = download_common_crawl(
    "/extracted/output/folder", 
    "2020-50",
    "2021-04",
    output_type="jsonl",
    algorithm=extraction_algorithm,
)
sarahyurick commented 16 hours ago

Thanks @ryantwolf ! Should be ready now.