MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Spark throughput #203

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

A particularly long running transformation scenario revealed a potential inefficiency -- perhaps a striking one -- in Spark throughput.

It appears that this transformation was run twice in the course of the Spark processing. This indicates that as Spark evaluates a path for DataFrames and RDDs, there was not a breakpoint / barrier, forcing downsteram operations to re-run the transformation quite early in the process.

This is worth investigating, as it has the potential to dramatically increase all Spark related activity (this would explain why Harvests also appear to harvest twice).

ghukill commented 6 years ago

Having trouble recreating, closing for now.