Closed yinfeiy-g closed 4 years ago
Can't reproduce. I thinks it's a temporary issue ? I don't think this AWS bucket has very high quotas.
ok, reducing the parallelism and it is working now. Thanks
For reproduce the results, how much disk space are required ?
Ping.
I have a machine with 1000G disk space but I still ran into out of disk space issue. Wondering how much is needed here.
Hi, you will need something like 10TB if I remember correctly with the current version of the code. I'm currently working on a new release of code and data that will limit the disk usage to what you actually extract, 2TB for all languages.
Is there a more exact estimate of the current disk space requirements?
I've a new version in the dev branch that will do streaming and cut the diskspace requirements to only what you keep so ~2TB. I'll update the issue once I merge this in master.
This has been merged in #13 .
Hi
I am trying to reproduce the results from your paper. However, after downloading the common crawl data from aws, the access to the precomputed files seems failed.
Did you changed the location of precomputed file ?
The error messages are like below: