Trogluddite / loombreaker

Tools for building Topic-Specific Web Indexes (CS-480 Capstone)
MIT License
0 stars 0 forks source link

configure Nutch to crawl a specific set of Wikipedia articles (elements) #43

Open Trogluddite opened 9 months ago

Trogluddite commented 9 months ago

To limit the scope and increase the relevance of our results, we want to search across a limited subset of data.

Elements on Wikipedia seem like a good place to start.

Trogluddite commented 9 months ago

I've completed this manually, using a process like this:

# setup
export NUTCH_HOME=/home/mpappas/apache-nutch-1.19/runtime/local
cd $NUTCH_HOME
vim urls/seed.txt  #added 118 Wikipedia pages (chemical elements)
export JAVA_HOME='/usr/lib/jvm/default-java'

# run in screen because the crawl will take a few hours
screen

# do the work
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 -filter -normalize -deleteGone
bin/nutch dedup crawl/crawldb -group domain
bin/nutch clean crawl/crawldb/