Open nrocketmann opened 1 year ago
Unfortunately we don't have a good solution for this. Meta doesn't want to redistribute a subset of CC for legal reason. You can download it directly from CC, but as you noticed, it's pretty slow because individual lines are spread out in a gigantic dataset.
Maybe someone will patiently download it and republish it in a simpler format, but it's not going to be Meta.
Hello team, I'm trying to download all the audio and text data associated with the
eng-frA
split of the Seamless data. My issue is with the text data. When I run thewet_lines
script, after getting back only 2353 entries, I start getting back 503SlowDown
responses from common crawl. I get the same error when I try to justwget
one of the CC urls. After some searching online, I see it seems that Common Crawl limits your request rate at some point. Considering there are 2.8M unique CC urls in theeng-frA.tsv
file, this makes it impossible to get anywhere close to all the data. Furthermore, it took ~3 hours to get those 2353 entries, so even without the 503 errors, this is too slow to get all the required data. Any advice on what to do? Thanks!