Closed MobinRanjbar closed 3 years ago
Hi, We don't have any publicly sharable benchmarks.
Besides, what exactly do you want to benchmark? FYI much of the time spent by the crawler is simply goes in the wait and delays we deliberately introduce between web page requests. Both nutch and sparkelr are "fair crawler" i.e. they don't bombard websites with too many requests (doing so will get the IPs blocked) even if we can.
If you have a precise definition of a benchmark - CPU/Memory or Disk/Network IO etc - please do it and publish it somewhere and update us. Benchmarks are best if unbiased, and they are unbiased if a third party like you does it.
Thanks
Thanks @thammegowda for your useful information,
I mean what did you want to achieve in using Spark? I want to see something like this: https://dzone.com/articles/the-battle-of-the-crawlers-apache-nutch-vs-stormcr#:~:text=This%20Q%20and%20A%20should,crawler%20based%20on%20Apache%20Hadoop.&text=StormCrawler%2C%20on%20the%20other%20hand,and%20at%20the%20same%20time.
Hi there,
Is there any performance benchmark around Sparkler vs. Nutch? What is the value of Spark in practice?
Best,