USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
412 stars 143 forks source link

Writing Data to Elasticsearch Storage Engine #224

Open Kefaun2601 opened 3 years ago

Kefaun2601 commented 3 years ago

Task Description

This is a task that is currently being worked on in order to provide Elasticsearch as a backend storage engine option for Sparkler. This builds upon the Factory Pattern outlined in Issue 218 where we abstract out storage engine-specific implementation.

To achieve the final goal of being able to write Sparkler data into the Elasticsearch storage engine, the team envisions that we'll be following these steps:

  1. Make sure the Elasticsearch storage engine is set up appropriately and ready to accept data
  2. Write simple data to Elasticsearch a. Perhaps a simple visualization to prove functionality
  3. Reorganize Sparkler data into a format conducive to Elasticsearch indexing
  4. Write data into Elasticsearch
  5. Visualize data in Elasticsearch (this will likely be brought up in a future issue)

This is a WIP and updates will be posted here as we make progress.

slhsxcmy commented 3 years ago

@thammegowda @buggtb @lewismc We had a few questions about Crawler.scala while adding Elasticsearch:

  1. How is the deep crawl different from a "normal" crawl? We only run deep crawl when -dc flag is enabled, but we always run normal crawl?
  2. What does the FairFetcher class do? Do we need to know since FairFetcher is not specific to Solr?
  3. Why is "storageProxy.commitCrawlDb()" called before the crawl, after the deep crawl, and after the normal crawl again?