facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
972 stars 142 forks source link

Decrease RAM usage, investigate miss documents #6

Closed gwenzek closed 4 years ago

gwenzek commented 4 years ago

@gwenzek I written a post with tips how to recreate this in GCP. Basically use S3 or Google cloud bucket and mount them as disk will save you a lot of storage fees

_Originally posted by @theblackcat102 in https://github.com/facebookresearch/cc_net/issues/2#issuecomment-599174158_

gwenzek commented 4 years ago

@theblackcat102 thanks for the blog post and your feedback. It's very interesting.

I think the current state of the "unminify" code is not satisfying. I made the error of not sorting the documents in my snapshot in the same order than in the original CC snapshot. I thought I could go around this by using RAM caching but that's not the right approach. I have more ideas to decrease RAM usage that I'll try in the upcoming weeks.

Total corpus size is 334.71 GB of compressed corpus, not a lot compared to CCNet paper results

For English for instance I find x6 times as you. This is weird, I have some integration tests which run on a small portion of the corpus and didn't observed significant document misses. I'll look into it.

gwenzek commented 4 years ago

I've merged #13 which now reduces the disk usage to only what's left at the end for the "reproduce" code. Memory usage is also way down. Let me know if you observe more issues.

theblackcat102 commented 4 years ago

@gwenzek since I am kinda in a hurry last time because of the limited free GCP credits, I will surely revisit this in sometime later. Reduced memory is certainly an improvement, since I have to resort to large disk swap to save cost. Thanks !