USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
412 stars 143 forks source link

Writing Data to Elasticsearch Storage Engine #225

Closed slhsxcmy closed 2 years ago

slhsxcmy commented 3 years ago

What changes were proposed in this pull request?

We have implemented the Factory Pattern to extract storage components (Solr and Elasticsearch) from Sparkler implementation. Currently, classes for Elasticsearch are placeholders and we are starting to implement those classes. We are also testing to make sure Solr can still run with the Factory.

We moved Solr related classes into sparkler-app/src/main/scala/edu/usc/irds/sparkler/storage/solr, including the original MemexDeepCrawlDbRDD and MemexCrawlDbRDD, and renamed them to SolrDeepRDD and SolrRDD to reflect their usage on Solr. Let us know if you think the naming convention deviates from the purpose and if we should change it again.

Is this related to an already existing issue on sparkler?

224

229

Kefaun2601 commented 3 years ago

@lewismc

We’ve been duplicating the *RDD.scala files (in this directory) and modifying them into Elasticsearch variants. Just confirming, is this the correct approach?

NOTE: this is still highly a work in progress. We would just like to confirm that we're working in the right direction and see if you have any suggestions. Thanks.

buggtb commented 2 years ago

I need ES support and I need to merge this into the mainline due to Github giving dodgy merge instructions and I'd rather not lose it. So I'm going to merge this in, then sync it with my mammoth mvn2sbt dev branch, clean up the integration and then merge the whole lot back into master