facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
972 stars 142 forks source link

Running on local files #22

Closed sashavor closed 3 years ago

sashavor commented 3 years ago

Hi, Is it possible to mine and analyze local wet files, without downloading from AWS?

Thanks!

sidsvash26 commented 3 years ago

I would also like to do this. I have a local common crawl raw.xz file that I want to process the cleaning script on. Any help on running the cleaning script on it would help.

gwenzek commented 3 years ago

The logic that open a segment (a .warc.wet.gz file) is located at https://github.com/facebookresearch/cc_net/blob/495959803205a7f10680fe45f1625f76d3c406b8/cc_net/process_wet_file.py#L184-L192 as you can see it always download from s3 unless the file is found in the cache_dir. You can set cache_dir in the config, but be careful that all segment not found there will be downloaded again.

otherwise you can modify the open_segment to open the files you want. does that help you ? do you have more precise questions ?

sashavor commented 3 years ago

Yes, that helps! Thank you :)

rongjingyue423 commented 1 year ago

how did you solve it?