Common crawl scraping limited + extremely slow

facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation

Other

10.91k stars 1.06k forks source link

Hello team, I'm trying to download all the audio and text data associated with the eng-frA split of the Seamless data. My issue is with the text data. When I run the wet_lines script, after getting back only 2353 entries, I start getting back 503 SlowDown responses from common crawl. I get the same error when I try to just wget one of the CC urls. After some searching online, I see it seems that Common Crawl limits your request rate at some point. Considering there are 2.8M unique CC urls in the eng-frA.tsv file, this makes it impossible to get anywhere close to all the data. Furthermore, it took ~3 hours to get those 2353 entries, so even without the 503 errors, this is too slow to get all the required data. Any advice on what to do? Thanks!

facebookresearch / seamless_communication

Common crawl scraping limited + extremely slow #205