NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
477 stars 57 forks source link

[FEA] Raise a warning when creating FuzzyDuplicatesConfig with non-empty cache_dir #84

Open randerzander opened 4 months ago

randerzander commented 4 months ago

In debugging a Curator pipeline, I was re-running the same stages multiple times. I was confused when FuzzyDedup succeeded the first time, but failed an assertion every time thereafter:

Stage1: Starting Minhash + LSH computation                                                                                                                                                    
/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py:175: UserWarning: Output path /fuzzy_cache/_minhashes.parquet already exists and will be overwritten  
  warnings.warn(                                                                                                                                                                              
/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py:361: UserWarning: Output path /fuzzy_cache/_buckets.parquet already exists and will be overwritten    
  warnings.warn(                                                                                                                                                                              Stage1: Minhash + LSH complete!                                                                
Stage2 (False Postive Check): Starting Map_Buckets                                                                                                                                            
Stage2 (False Postive Check): Map_Buckets Complete!                                                                                                                                           Stage3 (False Postive Check): Shuffle docs                                                                                                                                                    
Traceback (most recent call last):                                                                                                                                                              File "/scripts/tinystories/main.py", line 300, in <module>                                                                                                                                  
    main()                                                                                                                                                                                      File "/scripts/tinystories/main.py", line 296, in main                                                                                                                                      
    run_curation_pipeline(args)                                                                                                                                                                 File "/scripts/tinystories/main.py", line 263, in run_curation_pipeline            
    dataset = curation_steps(dataset)                                                                                                                                                         
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/meta.py", line 22, in __call__                                                                                   dataset = module(dataset)                                                                                                                                                                 
  File "/scripts/tinystories/main.py", line 210, in dedupe                                                                                                                                    
    duplicates = fuzzy_dup(dataset=dataset)                                                                                                                                                     File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py", line 501, in __call__                                                                       
    self.jaccard_shuffle.shuffle_docs_on_buckets(                                                                                                                                             
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py", line 844, in shuffle_docs_on_buckets                                                            self._batched_merge_and_write(                                                                                                                                                            
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py", line 895, in _batched_merge_and_write                                                           assert bucket_part_start_offset % parts_per_bucket_batch == 0                              
AssertionError                                                                                                                                                                                
2024-05-27 18:20:54,973 - distributed.scheduler - WARNING - Removing worker 'ucx://127.0.0.1:52217' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('get-partition-0-_get_output_part_ids_with_approx_equal_sum-e087b6dc0f16c1875378ff9957bef357', 0)} (stimulus_id='handle-worker-cleanup-1716834054.97311')

This is because I hadn't cleaned out the cache_dir before re-running. Manually doing so first resolved my issue.

It would be nice to check (and warn on non-empty) cache_dir on the user's behalf when the FuzzyDuplicatesConfig get instantiated.

randerzander commented 4 months ago

This is probably a dupe of #51 , but has some more explicit resolution info, so leaving it here for now