SPIKE: Investigate custom nifi ReportingTask for delivering file/object metrics to monitoring service

The ReportingTask approach looks like the way to go. You can iterate over all provenance event records, filter on processor type, filter on event type (e.g., FETCH or CONTENT_MODIFIED), and extract all flow file attributes from the event. This will allow us to centrally manage "trigger" events for all Fetch events.

There are a couple of things I've identified:

1) The ReportingTask can pull provenance events based on a minimum "event id." This can be persisted in memory has a global variable. However, on nifi stop/restart, this state is lost. This means we would have to pull from the very beginning of the provenance queue. One option would be to maintain this state elsewhere (in the database, in zookeeper), but that requires extra configuration and is not ideal. Instead, we are going to enforce an "at least once" policy on the reporting task, meaning the downstream system will be responsible for de-duplicating provenance event records that it has already seen.

2) ReportingTasks are set at the "top level" of the canvas. Flowlib does not currently support creating any elements on the top level, but this is something we can add.

3) We are establishing the requirement that all FlowFiles handled by the FetchS3Processor have two non-standard attributes:

orig_filename
workload_id

We still have a little bit of discovery/testing on the performance of the ReportingTask to make sure it doesn't impose a large burden on NiFi

B23admin / b23-flowlib

SPIKE: Investigate custom nifi ReportingTask for delivering file/object metrics to monitoring service #42