coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Export runtime monitoring to a dashboard #5

Open PeterCiuffetti opened 3 years ago

PeterCiuffetti commented 3 years ago

CoherenceBot produces a console log, but this is filled with noisy java logging statements.

Given that CoherenceBot will be running on multiple clusters, I'd like to have a way to go to one place and see if all the bots are still running, what phase of the loop they are in, how many new URLs came in with the last iteration, and other details like that.

As each bot will be responsible for a different set of seed URLs, I'd like to know how many domains it is handling, how many URLs is it managing and how many PDFs it is exporting.

I'd also like to monitor the hadoop file system on each cluster to see how full it is.

PeterCiuffetti commented 3 years ago

There are systems that can help here. The EMR installations come with a number of installed apps that give visibility into the running processes.

image

..you need an SSH tunnel to see them. I've found the ones I used bewilderingly complex and Im looking for something simpler. These apps show a job entry for literally every map-reduce step, and there can be thousands of these in a running crawl. And I'm not sure how /if they would be configured for multiple EMR instances. So something that greps the log for interesting rows and posts this to an analytics system might be better.