Webdevdata / fetcher

Tool to download website data.
The Unlicense
9 stars 4 forks source link

Fetcher

Scripts used to fetch the HTML files from top Alexa sites.

Methodology

Usage

If you're on Linux or OS X, simply run ./getData.sh and you should be good to go. If you're on Windows, cygwin may be your best bet.

If you want to fetch resources other than Alexa's top HTMLs, you can do that by doing something like cat urls.txt | xargs -I % -n 1 -P64 ./downloadr.py download % webdevdata.org-2013-12-06-200358/

Dependencies

If you use virtualenv, you can install the required Python package locally:

Whenever you want to run this script, use:

If you use autoenv the activation step will be done automatically on entering the directory.

Results

The resulting directory structure is:

The resulting files have an ".html.txt" extension for the data files and ".html.hdr.txt" extension for the header files.

Queries

A java based script is available to get statistics on html tags/attributes with CSS-like queries.

See the Queries on WebDevData wiki.