Open Trogluddite opened 9 months ago
I've completed this manually, using a process like this:
# setup
export NUTCH_HOME=/home/mpappas/apache-nutch-1.19/runtime/local
cd $NUTCH_HOME
vim urls/seed.txt #added 118 Wikipedia pages (chemical elements)
export JAVA_HOME='/usr/lib/jvm/default-java'
# run in screen because the crawl will take a few hours
screen
# do the work
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 -filter -normalize -deleteGone
bin/nutch dedup crawl/crawldb -group domain
bin/nutch clean crawl/crawldb/
To limit the scope and increase the relevance of our results, we want to search across a limited subset of data.
Elements on Wikipedia seem like a good place to start.