In debugging a Curator pipeline, I was re-running the same stages multiple times. I was confused when FuzzyDedup succeeded the first time, but failed an assertion every time thereafter:
Stage1: Starting Minhash + LSH computation
/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py:175: UserWarning: Output path /fuzzy_cache/_minhashes.parquet already exists and will be overwritten
warnings.warn(
/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py:361: UserWarning: Output path /fuzzy_cache/_buckets.parquet already exists and will be overwritten
warnings.warn( Stage1: Minhash + LSH complete!
Stage2 (False Postive Check): Starting Map_Buckets
Stage2 (False Postive Check): Map_Buckets Complete! Stage3 (False Postive Check): Shuffle docs
Traceback (most recent call last): File "/scripts/tinystories/main.py", line 300, in <module>
main() File "/scripts/tinystories/main.py", line 296, in main
run_curation_pipeline(args) File "/scripts/tinystories/main.py", line 263, in run_curation_pipeline
dataset = curation_steps(dataset)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/meta.py", line 22, in __call__ dataset = module(dataset)
File "/scripts/tinystories/main.py", line 210, in dedupe
duplicates = fuzzy_dup(dataset=dataset) File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py", line 501, in __call__
self.jaccard_shuffle.shuffle_docs_on_buckets(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py", line 844, in shuffle_docs_on_buckets self._batched_merge_and_write(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py", line 895, in _batched_merge_and_write assert bucket_part_start_offset % parts_per_bucket_batch == 0
AssertionError
2024-05-27 18:20:54,973 - distributed.scheduler - WARNING - Removing worker 'ucx://127.0.0.1:52217' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('get-partition-0-_get_output_part_ids_with_approx_equal_sum-e087b6dc0f16c1875378ff9957bef357', 0)} (stimulus_id='handle-worker-cleanup-1716834054.97311')
This is because I hadn't cleaned out the cache_dir before re-running. Manually doing so first resolved my issue.
It would be nice to check (and warn on non-empty) cache_dir on the user's behalf when the FuzzyDuplicatesConfig get instantiated.
In debugging a Curator pipeline, I was re-running the same stages multiple times. I was confused when FuzzyDedup succeeded the first time, but failed an assertion every time thereafter:
This is because I hadn't cleaned out the cache_dir before re-running. Manually doing so first resolved my issue.
It would be nice to check (and warn on non-empty) cache_dir on the user's behalf when the FuzzyDuplicatesConfig get instantiated.