facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
972 stars 142 forks source link

Cannot download the precpomputed files #7

Closed yinfeiy-g closed 4 years ago

yinfeiy-g commented 4 years ago

Hi

I am trying to reproduce the results from your paper. However, after downloading the common crawl data from aws, the access to the precomputed files seems failed.

Did you changed the location of precomputed file ?

The error messages are like below:

/.local/lib/python3.7/site-packages/cc_net/jsonql.py:1141: 
UserWarning: Swallowed error HTTPSConnectionPool(host='dl.fbaipublicfiles.com', port=443): Max retries exceeded with url: 
/cc_net/2019-09/en_head_0017.json.gz (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff4d87913d0>: 
Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')) while downloading https://dl.fbaipublicfiles.com/cc_net/2019-09/en_head_0017.json.gz (2 out of 3)
gwenzek commented 4 years ago

Can't reproduce. I thinks it's a temporary issue ? I don't think this AWS bucket has very high quotas.

yinfeiy-g commented 4 years ago

ok, reducing the parallelism and it is working now. Thanks

For reproduce the results, how much disk space are required ?

yinfeiy-g commented 4 years ago

Ping.

I have a machine with 1000G disk space but I still ran into out of disk space issue. Wondering how much is needed here.

gwenzek commented 4 years ago

Hi, you will need something like 10TB if I remember correctly with the current version of the code. I'm currently working on a new release of code and data that will limit the disk usage to what you actually extract, 2TB for all languages.

leogao2 commented 4 years ago

Is there a more exact estimate of the current disk space requirements?

gwenzek commented 4 years ago

I've a new version in the dev branch that will do streaming and cut the diskspace requirements to only what you keep so ~2TB. I'll update the issue once I merge this in master.

gwenzek commented 4 years ago

This has been merged in #13 .