facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
932 stars 138 forks source link

I want to copy the output data of CC_net directly, what should I do? #30

Closed mome1024 closed 2 years ago

mome1024 commented 2 years ago

run “python -m cc_net --config reproduce --dump 2019-09”

Can 403 still copy your output? metadata='https://dl.fbaipublicfiles.com/cc_net/1.0.0'

image

I don't have enough CPU and GPU resources to complete the mining process. I want to copy the output data of CC_net directly, what should I do?

gwenzek commented 2 years ago

So there are legal reasons that prevent us from distributing the Commoncrawl data directly. You need to download it from them.

What the reproduce script does is that it download from Commoncrawl and only keeps the highest quality documents and add them the metadata we computed (language and a perplexity score). So it doesn't need a lot of CPU, you're more limited by the network bandwith and the disk writing speed. No GPU was used to create cc_net, nor any is required to download it.