Trogluddite / loombreaker

Tools for building Topic-Specific Web Indexes (CS-480 Capstone)
MIT License
0 stars 0 forks source link

Install & Configure Nutch #23

Open Trogluddite opened 7 months ago

Trogluddite commented 7 months ago

Nutch is a webscraping tool; the goal here is to train it to gather some documents from the web, for storage in SOLR.

We should take good notes about how to use Nutch, and any observations about how easy it will be to swap in a different scraping utility later on.

we'll also use this tutorial as a starter guide: https://www.cs.toronto.edu/~muuo/blog/build-yourself-a-mini-search-engine/

WildfireGaming commented 7 months ago

Next thing I need to figure out is an in-depth understanding of how Nutch works.

Trogluddite commented 7 months ago

we might consider usinig a different scraper -- I don't know how complex this is or what the learning curve is ... but when I speced this out, I considered using Scrapy: https://scrapy.org/

The advantage is that it's python based, so we might have more flexibility in modifying it & integrating it.

WildfireGaming commented 7 months ago

https://www.mail-archive.com/user@nutch.apache.org/msg16741.html