Nutch allows various inquiries into its database regarding the status of each crawled URL. These are currently java classes called from command-line scripts that produce reports that land in hadoop's file system. So they are not very friendly, either to produce or to consume.
But I'd like to come up with a way to export data from this database about each crawled URL, aggregate it, and report it to a dashboard.
The top-level aggregation would be totals by collection or org, so this will also depend on the api that exposeds org and collection data. I will have to map URLs stored in nutch's database back to the org / collection it came from, even if the page depth is distant from the seed URL.
It would report on
URLs by date, by org,
URLs by status, successes and failures for each stage (fetch, parse, selection, export)
A nice-to-have would be the ability to drill down into a given org and see the actual URLs in a grid, with columns for status and dates for each phase of the crawl.
It would run at least daily, or possibly a final step at the end of each iteration loop that Nutch repeats. Reporting on the new segment of URLs it just crawled.
Nutch allows various inquiries into its database regarding the status of each crawled URL. These are currently java classes called from command-line scripts that produce reports that land in hadoop's file system. So they are not very friendly, either to produce or to consume.
But I'd like to come up with a way to export data from this database about each crawled URL, aggregate it, and report it to a dashboard.
The top-level aggregation would be totals by collection or org, so this will also depend on the api that exposeds org and collection data. I will have to map URLs stored in nutch's database back to the org / collection it came from, even if the page depth is distant from the seed URL.
It would report on
A nice-to-have would be the ability to drill down into a given org and see the actual URLs in a grid, with columns for status and dates for each phase of the crawl.
It would run at least daily, or possibly a final step at the end of each iteration loop that Nutch repeats. Reporting on the new segment of URLs it just crawled.
Need recommendations for where to put this data