Provide a regular corpus of the top 10,000 Alexa sites

Webdevdata / webdevdata.org

Website for reports, etc.

44 stars 7 forks source link

Provide a regular corpus of the top 10,000 Alexa sites #23

Open oli opened 10 years ago

oli commented 10 years ago

I’ve noticed that coding standards in the top 10,000 corpus are wildly different (and considerably more “best practices”) than the general “top million sites” corpus. It’d be good to maintain two sets of data: a “general web” sampling of the top million Alexa sites, plus as many as possible of the top 10,000 Alexa sites.

marcoscaceres commented 10 years ago

I'm wondering if having the top 100,000 would give us basically the same data. As you said, the 10,000 would give us, um, "best practice"... while the remaining 90,000 would give us "HTML as she is spoke" by everyone else.

The question is, is the top 90,000 sites representative enough of the top 1 million - and is the top 10K overly representative. I'm not sure how we would measure.