dakrone / itsy

A threaded web-spider written in Clojure
181 stars 30 forks source link

Itsy

A threaded web spider, written in Clojure.

Usage

In your project.clj:

[itsy "0.1.1"]

In your project:

(ns myns.foo
  (:require [itsy.core :refer :all]))

(defn my-handler [{:keys [url body]}]
  (println url "has a count of" (count body)))

(def c (crawl {;; initial URL to start crawling at (required)
               :url "http://aoeu.com"
               ;; handler to use for each page crawled (required)
               :handler my-handler
               ;; number of threads to use for crawling, (optional,
               ;; defaults to 5)
               :workers 10
               ;; number of urls to spider before crawling stops, note
               ;; that workers must still be stopped after crawling
               ;; stops. May be set to -1 to specify no limit.
               ;; (optional, defaults to 100)
               :url-limit 100
               ;; function to use to extract urls from a page, a
               ;; function that takes one argument, the body of a page.
               ;; (optional, defaults to itsy's extract-all)
               :url-extractor extract-all
               ;; http options for clj-http, (optional, defaults to
               ;; {:socket-timeout 10000 :conn-timeout 10000 :insecure? true})
               :http-opts {}
               ;; specifies whether to limit crawling to a single
               ;; domain. If false, does not limit domain, if true,
               ;; limits to the same domain as the original :url, if set
               ;; to a string, limits crawling to the hostname of the
               ;; given url
               :host-limit false
               ;; polite crawlers obey robots.txt directives
               ;; by default this crawler is polite
               :polite? true}))

;; ... crawling ensues ...

(thread-status c)
;; returns a map of thread-id to Thread.State:
{33 #<State RUNNABLE>, 34 #<State RUNNABLE>, 35 #<State RUNNABLE>,
 36 #<State RUNNABLE>, 37 #<State RUNNABLE>, 38 #<State RUNNABLE>,
 39 #<State RUNNABLE>, 40 #<State RUNNABLE>, 41 #<State RUNNABLE>,
 42 #<State RUNNABLE>}

(add-worker c)
;; adds an additional thread worker to the pool

(remove-worker c)
;; removes a worker from the pool

(stop-workers c)
;; stop-workers will return a collection of all threads it failed to
;; stop (it should be able to stop all threads unless something goes
;; very wrong)

Upon completion, c will contain state that allows you to see what happened:

(clojure.pprint/pprint (:state c))
;; URLs still in the queue
{:url-queue #<LinkedBlockingQueue []>,
;; URLs that were seen/queued
 :url-count #<Atom@67d6b87e: 2>,
 ;; running worker threads (will contain thread objects while crawling)
 :running-workers #<Ref@decdc7b: []>,
 ;; canaries for running worker threads
 :worker-canaries #<Ref@397f1661: {}>,
 ;; a map of URL to times seen/extracted from the body of a page
 :seen-urls
 #<Atom@469657c4:
   {"http://www.phpbb.com" 1,
    "http://pagead2.googlesyndication.com/pagead/show_ads.js" 2,
    "http://www.subBlue.com/" 1,
    "http://www.phpbb.com/" 1,
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" 1,
    "http://www.w3.org/1999/xhtml" 1,
    "http://forums.asdf.com" 1,
    "http://www.google.com/images/poweredby_transparent/poweredby_000000.gif" 1,
    "http://asdf.com" 1,
    "http://www.google.com/cse/api/branding.css" 1,
    "http://www.google.com/cse" 1}>}

Features

Included handlers

Itsy includes handlers for common actions, either to be used, or examples for writing your own.

Text file handler

The text file handler stores web pages in text files. It uses the html->str method in itsy.extract to convert HTML documents to plain text (which in turn uses Tika to extract HTML to plain text).

Usage:

(ns bar
  (:require [itsy.core :refer :all]
            [itsy.handlers.textfiles :refer :all]))

;; The directory will be created when the handler is created if it
;; doesn't already exist
(def txt-handler (make-textfile-handler {:directory "/mnt/data" :extension ".txt"}))

(def c (crawl {:url "http://example.com" :handler txt-handler}))

;; then look in the /mnt/data directory

ElasticSearch handler

The elasticsearch handler stores documents with the following mapping:

{:id {:type "string"
      :index "not_analyzed"
      :store "yes"}
 :url {:type "string"
       :index "not_analyzed"
       :store "yes"}
 :body {:type "string"
        :store "yes"}}

Usage:

(ns foo
  (:require [itsy.core :refer :all]
            [itsy.handlers.elasticsearch :refer :all]))

;; These are the default settings
(def index-settings {:settings
                     {:index
                      {:number_of_shards 2
                       :number_of_replicas 0}}})

;; If the ES index doesn't exist, make-es-handler will create it when called.
(def es-handler (make-es-handler {:es-url "http://localhost:9200/"
                                  :es-index "crawl"
                                  :es-type "page"
                                  :es-index-settings index-settings
                                  :http-opts {}}))

(def c (crawl {:url "http://example.com" :handler es-handler}))

;; ... crawling and indexing ensues ...

Todo

License

Copyright © 2012 Lee Hinman

Distributed under the Eclipse Public License, the same as Clojure.