Closed ghukill closed 6 years ago
Aside from instantiating index mapper only once per partition, as opposed to each record -- which has been implemented -- not seeing much potential for improvement from mapPartitions
in field mapping.
Applied where applicable, some definite performance gains (2-8x, depending).
After converting XSLT transformations that use pyjxslt to use
rdd.mapPartitions
(commit), and a relatively substantial performance increase, look to other areas where this could be applied:record_validation.py
ModuleType
to the partition mappingindex_mapper_handle()
that comes from GenericMapperAnything that is not row specific, can be removed from the UDF and applied at the partition level.