coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Export CoherenceBot crawl statistics to a dashboard #4

Open PeterCiuffetti opened 3 years ago

PeterCiuffetti commented 3 years ago

Nutch allows various inquiries into its database regarding the status of each crawled URL. These are currently java classes called from command-line scripts that produce reports that land in hadoop's file system. So they are not very friendly, either to produce or to consume.

But I'd like to come up with a way to export data from this database about each crawled URL, aggregate it, and report it to a dashboard.

The top-level aggregation would be totals by collection or org, so this will also depend on the api that exposeds org and collection data. I will have to map URLs stored in nutch's database back to the org / collection it came from, even if the page depth is distant from the seed URL.

It would report on

A nice-to-have would be the ability to drill down into a given org and see the actual URLs in a grid, with columns for status and dates for each phase of the crawl.

It would run at least daily, or possibly a final step at the end of each iteration loop that Nutch repeats. Reporting on the new segment of URLs it just crawled.

Need recommendations for where to put this data

PeterCiuffetti commented 3 years ago

This is probably 2 or 3 days of work.