USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

CLI argument to focus the crawl to a specific domain #140

Closed voltek62 closed 3 years ago

voltek62 commented 6 years ago

New feature

Currently, in order to crawl specific domain, I need to edit regex file for every crawl.

My suggestion is to add CLI argument to focus the crawl to a specific domain.

e. g for crawling only specific domains in seed files --specificdomain=TRUE

Your project is amazing, I prepare a R package to deploy Sparkler on Openstack with one code line and get results easily with R language.

All the best

thammegowda commented 6 years ago

Thanks for creating this issue and providing all the details. Looks like a nice addition to the project. We will work on it.

chrismattmann commented 6 years ago

Thanks so much!

ravituduru commented 4 years ago

May I know to edit this regex file for specific domain.