Provide (more) information on the corpus with each release

oli commented 10 years ago

The Fetcher script repo states the October 2013 data set (780 Mb, .7z file) is based on the top 1 million Alexa sites, and “Includes approx 78,000 HTML files”.

What is the exact number of HTML files? (This is necessary for calculating percentages, more below…)
How were the sites providing the ~78,000 HTML files chosen from the 1 million sites?
How many sites were attempted (e.g. how many blocked robots etc)?
Is the corpus all homepages? (I guess yes, so ~78,000 HTML files = ~78,000 sites)
If not, what is the total number of sites included in the corpus?

Regarding exact HTML file count, here’s what I found:

165,245 files
82,622 non .hdr.txt (header) files
78,156 match .html.txt files
the 4,466 files that don’t match .html.txt have the following extensions: ascii, assembler, c, c++, data, empty, exported, gif, gzip, iso-8859, jpeg, little-endian, minix, non-iso, pascal, php, png, sendmail, smile, troff, utf-8, very, xml
Most of these are just straight HTML with a funny extension, but the 333 .gzip files appear to be actual gzipped HTML files, so their content won’t show up when queried. The 281 .empty files appear to all be empty, and the seven images appear to be images.

So 82,622-621=82,001 HTML files? However, a bunch of these files are errors or other content that doesn’t qualify as an HTML page. Here are some file counts based on tags:

<!doctype occurs on 76,585 pages
<div occurs on 77,807 pages
<html occurs on 79,072 pages
<body occurs on 79,176 pages ← what I decided to use

HTH!

yoavweiss commented 10 years ago

@oli We're in a phase of transition to a more professional crawling infrastructure. Once we would I think (hope) most of the issues you're pointing out would be irrelevant. So, on the one hand, I don't think it's worth investing time in fixing current fetcher. OTOH, hang in there. Things will get better :)

oli commented 10 years ago

:+1: On further contemplation I’m not sure whether it would be good to prune the dreck, as it might still be of use for someone. However, it’d be great to have a way to quickly search only the HTML pages in the corpus…

marcoscaceres commented 10 years ago

@oli, the way I'm doing only the html is by:

find ./ -name "*ml.txt"

Don't know if that helps, but it's been ok for me so far.

Check out also: https://github.com/Webdevdata/webdevdata-tools

It has some useful tools that actually parse the HTML, hence avoiding some false positives that occur when grepping.

ernesto-jimenez commented 10 years ago

I'm currently using find in the same way too. I'm ignoring the other extensions since I assumed those are anomalies to be fixed in future versions of the crawler.

Webdevdata / webdevdata.org

Provide (more) information on the corpus with each release #22