Closed gwenzek closed 4 years ago
@theblackcat102 thanks for the blog post and your feedback. It's very interesting.
I think the current state of the "unminify" code is not satisfying. I made the error of not sorting the documents in my snapshot in the same order than in the original CC snapshot. I thought I could go around this by using RAM caching but that's not the right approach. I have more ideas to decrease RAM usage that I'll try in the upcoming weeks.
Total corpus size is 334.71 GB of compressed corpus, not a lot compared to CCNet paper results
For English for instance I find x6 times as you. This is weird, I have some integration tests which run on a small portion of the corpus and didn't observed significant document misses. I'll look into it.
I've merged #13 which now reduces the disk usage to only what's left at the end for the "reproduce" code. Memory usage is also way down. Let me know if you observe more issues.
@gwenzek since I am kinda in a hurry last time because of the limited free GCP credits, I will surely revisit this in sometime later. Reduced memory is certainly an improvement, since I have to resort to large disk swap to save cost. Thanks !
@gwenzek I written a post with tips how to recreate this in GCP. Basically use S3 or Google cloud bucket and mount them as disk will save you a lot of storage fees
_Originally posted by @theblackcat102 in https://github.com/facebookresearch/cc_net/issues/2#issuecomment-599174158_