Open Trogluddite opened 7 months ago
Next thing I need to figure out is an in-depth understanding of how Nutch works.
we might consider usinig a different scraper -- I don't know how complex this is or what the learning curve is ... but when I speced this out, I considered using Scrapy: https://scrapy.org/
The advantage is that it's python based, so we might have more flexibility in modifying it & integrating it.
Nutch is a webscraping tool; the goal here is to train it to gather some documents from the web, for storage in SOLR.
We should take good notes about how to use Nutch, and any observations about how easy it will be to swap in a different scraping utility later on.
we'll also use this tutorial as a starter guide: https://www.cs.toronto.edu/~muuo/blog/build-yourself-a-mini-search-engine/