USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Is there any benchmark Sparkler versus Nutch? #186

Closed MobinRanjbar closed 3 years ago

MobinRanjbar commented 4 years ago

Hi there,

Is there any performance benchmark around Sparkler vs. Nutch? What is the value of Spark in practice?

Best,

thammegowda commented 4 years ago

Hi, We don't have any publicly sharable benchmarks.

Besides, what exactly do you want to benchmark? FYI much of the time spent by the crawler is simply goes in the wait and delays we deliberately introduce between web page requests. Both nutch and sparkelr are "fair crawler" i.e. they don't bombard websites with too many requests (doing so will get the IPs blocked) even if we can.

If you have a precise definition of a benchmark - CPU/Memory or Disk/Network IO etc - please do it and publish it somewhere and update us. Benchmarks are best if unbiased, and they are unbiased if a third party like you does it.

Thanks

MobinRanjbar commented 4 years ago

Thanks @thammegowda for your useful information,

I mean what did you want to achieve in using Spark? I want to see something like this: https://dzone.com/articles/the-battle-of-the-crawlers-apache-nutch-vs-stormcr#:~:text=This%20Q%20and%20A%20should,crawler%20based%20on%20Apache%20Hadoop.&text=StormCrawler%2C%20on%20the%20other%20hand,and%20at%20the%20same%20time.