Closed sashavor closed 3 years ago
I would also like to do this. I have a local common crawl raw.xz
file that I want to process the cleaning script on. Any help on running the cleaning script on it would help.
The logic that open a segment (a .warc.wet.gz file) is located at https://github.com/facebookresearch/cc_net/blob/495959803205a7f10680fe45f1625f76d3c406b8/cc_net/process_wet_file.py#L184-L192
as you can see it always download from s3 unless the file is found in the cache_dir.
You can set cache_dir
in the config, but be careful that all segment not found there will be downloaded again.
otherwise you can modify the open_segment
to open the files you want.
does that help you ? do you have more precise questions ?
Yes, that helps! Thank you :)
how did you solve it?
Hi, Is it possible to mine and analyze local wet files, without downloading from AWS?
Thanks!