look to mapPartitions for performance increase

MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform

MIT License

26 stars 11 forks source link

look to mapPartitions for performance increase #186

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

After converting XSLT transformations that use pyjxslt to use rdd.mapPartitions (commit), and a relatively substantial performance increase, look to other areas where this could be applied:

validation udfs in record_validation.py
python transformation
- move the temp ModuleType to the partition mapping
Field mapping for ES, for index_mapper_handle() that comes from GenericMapper

Anything that is not row specific, can be removed from the UDF and applied at the partition level.

ghukill commented 6 years ago

Aside from instantiating index mapper only once per partition, as opposed to each record -- which has been implemented -- not seeing much potential for improvement from mapPartitions in field mapping.

ghukill commented 6 years ago

Applied where applicable, some definite performance gains (2-8x, depending).