Closed oli closed 10 years ago
@oli We're in a phase of transition to a more professional crawling infrastructure. Once we would I think (hope) most of the issues you're pointing out would be irrelevant. So, on the one hand, I don't think it's worth investing time in fixing current fetcher. OTOH, hang in there. Things will get better :)
:+1: On further contemplation I’m not sure whether it would be good to prune the dreck, as it might still be of use for someone. However, it’d be great to have a way to quickly search only the HTML pages in the corpus…
@oli, the way I'm doing only the html is by:
find ./ -name "*ml.txt"
Don't know if that helps, but it's been ok for me so far.
Check out also: https://github.com/Webdevdata/webdevdata-tools
It has some useful tools that actually parse the HTML, hence avoiding some false positives that occur when grepping.
I'm currently using find in the same way too. I'm ignoring the other extensions since I assumed those are anomalies to be fixed in future versions of the crawler.
The Fetcher script repo states the October 2013 data set (780 Mb, .7z file) is based on the top 1 million Alexa sites, and “Includes approx 78,000 HTML files”.
Regarding exact HTML file count, here’s what I found:
.hdr.txt
(header) files.html.txt
files.html.txt
have the following extensions:ascii
,assembler
,c
,c++
,data
,empty
,exported
,gif
,gzip
,iso-8859
,jpeg
,little-endian
,minix
,non-iso
,pascal
,php
,png
,sendmail
,smile
,troff
,utf-8
,very
,xml
.gzip
files appear to be actual gzipped HTML files, so their content won’t show up when queried. The 281.empty
files appear to all be empty, and the seven images appear to be images.So 82,622-621=82,001 HTML files? However, a bunch of these files are errors or other content that doesn’t qualify as an HTML page. Here are some file counts based on tags:
<!doctype
occurs on 76,585 pages<div
occurs on 77,807 pages<html
occurs on 79,072 pages<body
occurs on 79,176 pages ← what I decided to useHTH!