facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.91k stars 1.06k forks source link

Common crawl scraping limited + extremely slow #205

Open nrocketmann opened 1 year ago

nrocketmann commented 1 year ago

Hello team, I'm trying to download all the audio and text data associated with the eng-frA split of the Seamless data. My issue is with the text data. When I run the wet_lines script, after getting back only 2353 entries, I start getting back 503 SlowDown responses from common crawl. I get the same error when I try to just wget one of the CC urls. After some searching online, I see it seems that Common Crawl limits your request rate at some point. Considering there are 2.8M unique CC urls in the eng-frA.tsv file, this makes it impossible to get anywhere close to all the data. Furthermore, it took ~3 hours to get those 2353 entries, so even without the 503 errors, this is too slow to get all the required data. Any advice on what to do? Thanks!

gwenzek commented 10 months ago

Unfortunately we don't have a good solution for this. Meta doesn't want to redistribute a subset of CC for legal reason. You can download it directly from CC, but as you noticed, it's pretty slow because individual lines are spread out in a gigantic dataset.

Maybe someone will patiently download it and republish it in a simpler format, but it's not going to be Meta.