configure Nutch to crawl a specific set of Wikipedia articles (elements)

I've completed this manually, using a process like this:

# setup
export NUTCH_HOME=/home/mpappas/apache-nutch-1.19/runtime/local
cd $NUTCH_HOME
vim urls/seed.txt  #added 118 Wikipedia pages (chemical elements)
export JAVA_HOME='/usr/lib/jvm/default-java'

# run in screen because the crawl will take a few hours
screen

# do the work
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 -filter -normalize -deleteGone
bin/nutch dedup crawl/crawldb -group domain
bin/nutch clean crawl/crawldb/

Trogluddite / loombreaker

configure Nutch to crawl a specific set of Wikipedia articles (elements) #43