Webdevdata / webdevdata.org

Website for reports, etc.
44 stars 7 forks source link

Document methology #5

Closed zcorpan closed 11 years ago

zcorpan commented 11 years ago

Please document how the data was gathered, where the list of "top X sites" comes from, which downloads were included and which were rejected, if redirects are followed, what request headers are used, and so forth. Without knowing what the data is and how it is biased it is hard to reason about and draw conclusions from it.

stevefaulkner commented 11 years ago

for the december 12 data set. The list of top sites = Alexa Top 1,000,000 Sites (Updated Daily) http://s3.amazonaws.com/alexa-static/top-1m.csv.zip I then took the first 50,000 URLs from that I used HTTrack website copier (http://www.httrack.com/) to capture the HTML files of the home pages for each URL in the list. The initial pass was somewhat effected by redirects, so I went through the error log and collected a second list of URLs from the captured pages that had resulted in “page has moved” files. any that redirected in the second pass i removed ending up with approx 35,000 home pages.

marcoscaceres commented 11 years ago

@stevefaulkner, if possible, could you please clean up the above a little bit and we can just add it to the front page of webdevdata.org.

@zcorpan would that be sufficient to close the bug?

stevefaulkner commented 11 years ago

yeah will do

zcorpan commented 11 years ago

Yes.

yoavweiss commented 11 years ago

Documentation sucks. Scripts FTW :) https://github.com/yoavweiss/webdevdata.org/commit/33a7aaa439a26a797d20583973d993f3ca144937

marcoscaceres commented 11 years ago

heh. It's still nice for people who land on our site to know what the scripts are doing.

yoavweiss commented 11 years ago

I was kidding... Documentation coming right up. On Apr 29, 2013 10:23 PM, "Marcos Caceres" notifications@github.com wrote:

heh. It's still nice for people who land on our site to know what the scripts are doing.

— Reply to this email directly or view it on GitHubhttps://github.com/Webdevdata/webdevdata.org/issues/5#issuecomment-17191742 .

yoavweiss commented 11 years ago

Should be resolved with https://github.com/yoavweiss/webdevdata.org/commit/6053bf758a17ebc51a5742a20e521d53fd7c2ab9 and https://github.com/yoavweiss/webdevdata.org/commit/88f5141c435301b5c480e771b13050609da9f8a3

zcorpan commented 11 years ago

LGTM