USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
412 stars 143 forks source link

Elasticsearch for Sparkler - Command Line Configuration #211

Closed Kefaun2601 closed 3 years ago

Kefaun2601 commented 3 years ago

We are setting up an Elasticsearch backend for Sparkler. This will serve as another pipeline for data persistence parallel to the existing Apache Solr connector.

Sparkler Committers, do you have any advice on how we should configure the command line options to allow the user to specify Solr or Elasticsearch on startup? Ideally, we would like to minimize friction with the already existing framework.

Thanks in advance!

thammegowda commented 3 years ago

Hi @Kefaun2601 ! This is a great idea.

Currently, solr is the default, and its settings are read from conf file: https://github.com/USCDataScience/sparkler/blob/53f54746eb00d35a3f93fac0d3b8dbaa895d5755/sparkler-core/conf/sparkler-default.yaml#L18-L29

I suggest modifying the config to support elastic search

crawldb.backend: elasticsearch  # "solr" is default until elastic becomes usable. 
# add any settings necessary for elasticsearch
# if there are too many, create `elasticsearch:` block of config
elasticsearch: 
   uri: xyz 
   arg1: val1
   arg2: val2

I prefer adding this inside the config instead of CLI, because of several reasons.

  1. user don't need to specify this argument for every run. They could make mistakes by specifying it one time and not specifying it other time
  2. Once a user chooses to go with either elasticsearch or solr during the setup, they can't change their mind at the runtime; this is a kind of firm choice to make and stick with it. Its convenient to put those choices in config since we already have it.

If you feel the config is friction, and lets consider minimize it by creating startup scripts to automate it.

Kefaun2601 commented 3 years ago

@thammegowda This is of great help! Our capstone team members will look into setting up Elasticsearch according to your advice above. Many thanks!