Webdevdata / webdevdata.org

Website for reports, etc.
44 stars 7 forks source link

Provide (more) information on the corpus with each release #22

Closed oli closed 10 years ago

oli commented 10 years ago

The Fetcher script repo states the October 2013 data set (780 Mb, .7z file) is based on the top 1 million Alexa sites, and “Includes approx 78,000 HTML files”.


Regarding exact HTML file count, here’s what I found:

So 82,622-621=82,001 HTML files? However, a bunch of these files are errors or other content that doesn’t qualify as an HTML page. Here are some file counts based on tags:

HTH!

yoavweiss commented 10 years ago

@oli We're in a phase of transition to a more professional crawling infrastructure. Once we would I think (hope) most of the issues you're pointing out would be irrelevant. So, on the one hand, I don't think it's worth investing time in fixing current fetcher. OTOH, hang in there. Things will get better :)

oli commented 10 years ago

:+1: On further contemplation I’m not sure whether it would be good to prune the dreck, as it might still be of use for someone. However, it’d be great to have a way to quickly search only the HTML pages in the corpus…

marcoscaceres commented 10 years ago

@oli, the way I'm doing only the html is by:

find ./ -name "*ml.txt" 

Don't know if that helps, but it's been ok for me so far.

Check out also: https://github.com/Webdevdata/webdevdata-tools

It has some useful tools that actually parse the HTML, hence avoiding some false positives that occur when grepping.

ernesto-jimenez commented 10 years ago

I'm currently using find in the same way too. I'm ignoring the other extensions since I assumed those are anomalies to be fixed in future versions of the crawler.