Closed mome1024 closed 2 years ago
So there are legal reasons that prevent us from distributing the Commoncrawl data directly. You need to download it from them.
What the reproduce script does is that it download from Commoncrawl and only keeps the highest quality documents and add them the metadata we computed (language and a perplexity score). So it doesn't need a lot of CPU, you're more limited by the network bandwith and the disk writing speed. No GPU was used to create cc_net, nor any is required to download it.
run “python -m cc_net --config reproduce --dump 2019-09”
Can 403 still copy your output? metadata='https://dl.fbaipublicfiles.com/cc_net/1.0.0'
I don't have enough CPU and GPU resources to complete the mining process. I want to copy the output data of CC_net directly, what should I do?