Spark throughput - Githubissues

A particularly long running transformation scenario revealed a potential inefficiency -- perhaps a striking one -- in Spark throughput.

It appears that this transformation was run twice in the course of the Spark processing. This indicates that as Spark evaluates a path for DataFrames and RDDs, there was not a breakpoint / barrier, forcing downsteram operations to re-run the transformation quite early in the process.

This is worth investigating, as it has the potential to dramatically increase all Spark related activity (this would explain why Harvests also appear to harvest twice).

MI-DPLA / combine

Spark throughput #203