MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

look to mapPartitions for performance increase #186

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

After converting XSLT transformations that use pyjxslt to use rdd.mapPartitions (commit), and a relatively substantial performance increase, look to other areas where this could be applied:

Anything that is not row specific, can be removed from the UDF and applied at the partition level.

ghukill commented 6 years ago

Aside from instantiating index mapper only once per partition, as opposed to each record -- which has been implemented -- not seeing much potential for improvement from mapPartitions in field mapping.

ghukill commented 6 years ago

Applied where applicable, some definite performance gains (2-8x, depending).