Open sarahyurick opened 4 days ago
Hi @ryantwolf this is ready for another review.
The only question I have is what you think the best way to go about adding the unit tests is? Locally I'm testing it with download_common_crawl.py but that seems a bit heavy for CI? Even doing a single snapshot with url_limit = 10
takes at least a couple minutes each to run.
Edit: Perhaps it may be sufficient to add examples of JusTextExtraction()
and ResiliparseExtraction()
to download_common_crawl.py?
The only question I have is what you think the best way to go about adding the unit tests is?
Good question. I wouldn't add a unit test for the download_and_extract
. I would just target each algorithm's extract_text
method. You could draft like one or two simple html pages and make a unit test using each algorithm. Just something to make sure the behavior stays consistent.
Perhaps it may be sufficient to add examples of JusTextExtraction() and ResiliparseExtraction() to download_common_crawl.py?
It isn't a bad idea to showcase how users can change the algorithm. Do you mind updating the download.rst
docs instead? I'm thinking right after we explain what output_type="jsonl"
does you could add a snippet about the algorithm parameter. Or, you could add another small code block like
from nemo_curator.download import (
download_common_crawl,
ResiliparseExtraction,
)
# Change the extraction algorithm
extraction_algorithm = ResiliparseExtraction()
common_crawl = download_common_crawl(
"/extracted/output/folder",
"2020-50",
"2021-04",
output_type="jsonl",
algorithm=extraction_algorithm,
)
Thanks @ryantwolf ! Should be ready now.
Duplicate of https://github.com/NVIDIA/NeMo-Curator/pull/90 with successful DCO check.
Right now, we only support Common Crawl text extraction with jusText. Resiliparse is known to be a faster text extraction algorithm which may also produce better tokens.
This PR adds optional support for the Resiliparse algorithm while still keeping jusText as the default.