Closed StellaAthena closed 3 years ago
All of 2011 is scraped and saved using lm_dataformat. Scraping just finished 01/2012 and came out to 570mb. Scraper is now using timeout of 60 seconds so is basically set and forget.
@researcher2 how long do you expect the scraping to take?
Based off current trends, weeks.
I am now running an extra instance on Hetzner but the machine is maxed out. Estimate to complete 2012 is 1-2 more days. Similar for 2013. 2014 and 2015 likely 2-3 days each. 2016 3-4 days. 2017 5-7 days. 2018 7-9 days. 2019 11-13 days.
bmk has offered to run the scraper on multiple servers. He is starting with 2014-2016.
2012 will be done in 3.5 hours.
I have merged in the reddit metadata and filtered on combined submission score of 3 or greater. Currently generating minhashes for the remaining content on 3 different boxes, should take a few days. I have reduced the number of hash functions used by minhash to 10 in line with OpenAI implementation. After that I will be setting up Cassandra on Eleuther hetzner to perform the minhash lsh dedupe.
Priority: medium