Config File and StorageProxy Abstractions

USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

http://irds.usc.edu/sparkler/

Apache License 2.0

410 stars 143 forks source link

Config File and StorageProxy Abstractions #216

Closed Kefaun2601 closed 3 years ago

Kefaun2601 commented 3 years ago

What changes were proposed in this pull request?

Changes to sparkler-default.yaml configuration:

Abstracted out crawldb.uri into crawldb.backend, solr.uri, and elasticsearch.uri

Added StorageProxy and StorageProxyFactory:

WIP Abstracting out Solr implementation into the SolrProxy wrapper
StorageProxyFactory for returning the correct StorageProxy (SolrProxy or ElasticsearchProxy) based on the yaml config file

Is this related to an already existing issue on sparkler?
Related to #211 Related to #218

Will it close an existing issue?
Does not close an issue yet.

How was this patch tested?

This patch was tested by running "mvn clean package" within the "sparkler-core" directory. The tests currently pass, and it builds successfully.

lewismc commented 3 years ago

This is looking better @kyan2601 thanks.

Kefaun2601 commented 3 years ago

Confirmed that I can run a crawl job and view the data on the Banana dashboard. Did not notice any regressions in functionality that may have arose from the implemented Solr abstractions.

Commands run (as per the repo README):

# Start in the sparkler-core directory
cd sparkler-core

# Run script to start docker container and forward ports to host
bash ./bin/dockler.sh

# Inject seed urls
/data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'

# Start the crawl job
/data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2

Access the Banana dashboard at http://localhost:8983/banana/ to see the data.

lewismc commented 3 years ago

HJi @Kefaun2601 this PR and branch has a conflict which must be resolved. Once that's done, please tag me and I will test it out. Thank you

Kefaun2601 commented 3 years ago

@lewismc Resolved the merge conflicts. Could you please review it? Thanks!

Kefaun2601 commented 3 years ago

Added documentation for specifying the storage engine to use in the config file (sparkler-default.yaml): https://github.com/USCDataScience/sparkler/wiki/Specifying-CrawlDB-in-Config

Documentation on the StorageProxyFactory abstraction will be coming.

Kefaun2601 commented 3 years ago

Added documentation for the StorageProxyFactory abstraction. Will build on this documentation as we expand the factory.

https://github.com/USCDataScience/sparkler/wiki/StorageProxyFactory-Abstraction

lewismc commented 3 years ago

Also confirmed that this DOES NOT introduce a regression. Thank you @felixloesing and @kyan2601