Closed zcorpan closed 11 years ago
for the december 12 data set. The list of top sites = Alexa Top 1,000,000 Sites (Updated Daily) http://s3.amazonaws.com/alexa-static/top-1m.csv.zip I then took the first 50,000 URLs from that I used HTTrack website copier (http://www.httrack.com/) to capture the HTML files of the home pages for each URL in the list. The initial pass was somewhat effected by redirects, so I went through the error log and collected a second list of URLs from the captured pages that had resulted in “page has moved” files. any that redirected in the second pass i removed ending up with approx 35,000 home pages.
@stevefaulkner, if possible, could you please clean up the above a little bit and we can just add it to the front page of webdevdata.org.
@zcorpan would that be sufficient to close the bug?
yeah will do
Yes.
Documentation sucks. Scripts FTW :) https://github.com/yoavweiss/webdevdata.org/commit/33a7aaa439a26a797d20583973d993f3ca144937
heh. It's still nice for people who land on our site to know what the scripts are doing.
I was kidding... Documentation coming right up. On Apr 29, 2013 10:23 PM, "Marcos Caceres" notifications@github.com wrote:
heh. It's still nice for people who land on our site to know what the scripts are doing.
— Reply to this email directly or view it on GitHubhttps://github.com/Webdevdata/webdevdata.org/issues/5#issuecomment-17191742 .
LGTM
Please document how the data was gathered, where the list of "top X sites" comes from, which downloads were included and which were rejected, if redirects are followed, what request headers are used, and so forth. Without knowing what the data is and how it is biased it is hard to reason about and draw conclusions from it.