WebText 2: high quality webpages scraped from Reddit links

EleutherAI / the-pile

MIT License

1.47k stars 127 forks source link

WebText 2: high quality webpages scraped from Reddit links #6

Closed StellaAthena closed 3 years ago

StellaAthena commented 4 years ago

Priority: medium

researcher2 commented 4 years ago

All of 2011 is scraped and saved using lm_dataformat. Scraping just finished 01/2012 and came out to 570mb. Scraper is now using timeout of 60 seconds so is basically set and forget.

StellaAthena commented 4 years ago

@researcher2 how long do you expect the scraping to take?

researcher2 commented 4 years ago

Based off current trends, weeks.

I am now running an extra instance on Hetzner but the machine is maxed out. Estimate to complete 2012 is 1-2 more days. Similar for 2013. 2014 and 2015 likely 2-3 days each. 2016 3-4 days. 2017 5-7 days. 2018 7-9 days. 2019 11-13 days.

bmk has offered to run the scraper on multiple servers. He is starting with 2014-2016.

researcher2 commented 3 years ago

2012 will be done in 3.5 hours.

researcher2 commented 3 years ago

I have merged in the reddit metadata and filtered on combined submission score of 3 or greater. Currently generating minhashes for the remaining content on 3 different boxes, should take a few days. I have reduced the number of hash functions used by minhash to 10 in line with OpenAI implementation. After that I will be setting up Cassandra on Eleuther hetzner to perform the minhash lsh dedupe.