A library for fast web crawling in Clojure.
Starting a crawl with which crawls at most 10000 pages.
(require '[clojure.java.io :as io]
'[ramper.instance :as instance])
(instance/start (io/file "seed.txt") (io/file "store-dir") {:max-urls 10000})
It's best to test the machine(s) you are using for a crawl against a local proxy server,
so that the throughput is high in the case the bandwidth is not the limiting factor.
By default you can pass the http-opts
with :proxy-url
option at initialization time.
(intance/start seed-file store-dir
{:max-urls 100000
:nb-fetchers 32
:nb-parsers 10
:http-opts {:proxy-url "http://localhost:8080"}})
In case you are using a different library for fetching (via the http-get
option) you need to make sure that
this function uses a proxy either via http-opts
or otherwise.
Running the crawler against a local graph server we can use BUbiNG. The following will start a server on port 8080 with a 100 Million sites, average page degree 50, average depth 3 and 0.01% of sites being broken.
java -cp bubing-0.9.15.jar:bubing-0.9.15-deps/* -Xmx4G -server it.unimi.di.law.bubing.test.NamedGraphServerHttpProxy -s 100000000 -d 50 -m 3 -t 1000 -D .0001 -A1000 -
The precompiled jars can be found at http://law.di.unimi.it/software/index.php?path=download/.
There also is a babashka script that downloads all of these dependencies for you and launches the proxy server with the above configuration for you.
./download_bubing
Ramper comes with a couple of options to customize your crawl. These are
fetch-filter
A filter that is applied to every url before it goes through the sieve. Let's say you would
want to only fetch urls that contain clojure
in their name and use the https scheme.
(require '[ramper.customization :as custom])
(defn clojure-url? [url]
(clojure.string/index-of url "clojure"))
(instance/start seed-file store-dir {:fetch-filter (every-pred custom/https? clojure-url?)}
schedule-filter
A filter that is applied to every url before the resource gets fetched (just after the sieve). For example let's you want to only fetch a limited number of urls per domain.
(defn max-per-domain-filter [max-per-domain]
(let [domain-to-count (atom {})]
(fn [url]
(let [base (url/base url)]
(when (< (get @domain-to-count base 0) max-per-domain)
(swap! domain-to-count update base (fnil inc 0))
true)))))
(instance/start seed-file store-dir {:schedule-filter (max-per-domain 100)})
The max-per-domain-filter
is also provided by the customization ns.
store-filter
A filter that is applied before a response is stored. Suppose you want to only store sites that contain the word "clojure".
(require '[clojure.string :as str]
'[ramper.html-parser :as html])
(defn contains-clojure? [resp]
(some-> resp :body html/html->text str/lower-case (clojure.string/index-of "clojure")))
(instance/start seed-file store-dir {:store-filter contains-clojure? })
follow-filter
In the same vein as above, suppose you only want to continue following the links of a page when the it contains the word "clojure".
(instance/start seed-file store-dir {:follow-filter contains-clojure?})
By default the robots.txt standard is followed. Meaning the "robots.txt" is downloaded before any
other content is fetched and adhered by. "nofollow" attributes are respected.
Currently robot meta tags of the form <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
are ignored.
When developing you need to build the java files once before jacking in.
clojure -T:build java
The tests can be run with
clojure -X:test
If one wants to run a specific test, use the -X
option. See also cognitect.test-runner for options which tests to invoke.
clojure -X:test :nses [ramper.workers.parsing-thread-test]
For Java mission control to work correctly you need to set
echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid
Distributed under the MIT License. See LICENSE.