webcrystal is:
webcrystal is intended as a tool for archiving websites. It is also intended to be convenient to write HTTP-based and browser-based web scrapers on top of.
urllib3
).pip3 install webcrystal
To start the proxy run a command like:
webcrystal.py 9227 xkcd.wbcr http://xkcd.com/
Then you can visit http://localhost:9227/ to have the same effect as visiting http://xkcd.com/ directly, except that all requests are archived in xkcd.wbcr/
.
When you access an HTTP resource through the webcrystal proxy for the first time, it will be fetched from the origin HTTP server and archived locally. All subsequent requests for the same resource will be returned from the archive.
To start the webcrystal proxy:
webcrystal.py [--help] [--quiet] <port> <archive_dirpath> [<default_origin_domain>]
To stop the proxy press ^C or send a SIGINT signal to it.
webcrystal.py --help
This outputs:
usage: webcrystal.py [-h] [-q] port archive_dirpath [default_origin_domain]
An archiving HTTP proxy and web service.
positional arguments:
port Port on which to run the HTTP proxy. Suggest 9227
(WBCR).
archive_dirpath Path to the archive directory. Usually has .wbcr
extension.
default_origin_domain
Default HTTP domain which the HTTP proxy will redirect
to if no URL is specified.
optional arguments:
-h, --help Show this help message and exit.
-q, --quiet Suppresses all output.
The HTTP API is the primary API for interacting with the webcrystal proxy.
While the proxy is running, it responds to the following HTTP endpoints.
Notice that GET is an accepted method for all endpoints, so that they can be easily requested using a regular web browser. Browser accessibility is convenient for manual inspection and browser-based website scrapers.
GET,HEAD /
Redirects to the home page of the default origin domain if it was specified at the CLI. Returns:
GET,HEAD /_/http[s]/__PATH__
If in online mode (the default):
Cache-Control=no-cache
request header is specified, orPragma=no-cache
request header is specified.If in offline mode:
POST,GET /_online
Switches the proxy to online mode.
POST,GET /_offline
Switches the proxy to offline mode.
GET,HEAD /_raw/http[s]/__PATH__
Returns the specified resource from the archive if it is already archived. Nothing about the resource will be rewritten including any URLs that appear in HTTP headers or content. The intent is that the returned resource be as close to the original response from the origin server as is practical.
If the resource is not in the archive, returns:
POST,GET /_refresh/http[s]/__PATH__
Refetches the specified URL from the origin server using the same request headers as the last time it was fetched. Returns:
POST,GET /_delete/http[s]/__PATH__
Deletes the specified URL in the archive. Returns:
When the proxy is started with a command like:
webcrystal.py 9227 website.wbcr
It creates an archive in the directory website.wbcr/
in the following format:
website.wbcr/index.txt
\n
).Example:
http://xkcd.com/
http://xkcd.com/s/b0dcca.css
http://xkcd.com/1645/
The preceding example archive contains 3 HTTP resources, numbered #1, #2, and #3.
website.wbcr/1.request_headers.json
Example:
{"Accept-Language": "en-us", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Host": "xkcd.com", "Accept-Encoding": "gzip, deflate", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/601.4.4 (KHTML, like Gecko) Version/9.0.3 Safari/601.4.4"}
website.wbcr/1.response_headers.json
Example:
{"Cache-Control": "public", "Connection": "keep-alive", "Accept-Ranges": "bytes", "X-Cache-Hits": "0", "Date": "Tue, 15 Mar 2016 04:37:05 GMT", "Age": "0", "X-Served-By": "cache-sjc3628-SJC", "Content-Type": "text/html", "Server": "lighttpd/1.4.28", "X-Status-Code": "404", "X-Cache": "MISS", "Content-Length": "345", "X-Timer": "S1458016625.375814,VS0,VE148", "Via": "1.1 varnish"}
website.wbcr/1.response_body.dat
pip3 install -r dev-requirements.txt
make test
make coverage
open htmlcov/index.html
setup.py
.python3 setup.py sdist bdist_wheel upload
This code is provided under the MIT License. See LICENSE file for details.